Predicting 10-year CV risk with Framingham data using logistic regression: The ‘good old’ way vs. LASSO–two competing models

When models compete... it sometimes is just plain ridiculous.

When models compete… it sometimes is just plain ridiculous.

breakdancefight

I have recently been thinking a lot about both CV risk modeling (esp. given some of the controversy the newest ATP 10-year ASCVD risk calculator has ignited) and regression methods with cross-validation. Now, putting the particular debate around the newer ATP 10-year risk modeler aside, I found myself wondering how different ways of applying regression methods to a problem can result in models with different predictive performance. Specifically, what I’m referring to is comparing the performance of the ‘good old’ method of model selection–and by that I mean the way I was taught to apply it in graduate school–to a ‘newer’ regression shrinkage method (like LR with lasso). Now, when I say the “good old” way, I’m referring to a method that involves: 1) throwing all the features into the model, 2) evaluating the significance of each feature for selection, and 3) re-specifying the model with only those features you retained from step2. Also, while I might call LR with LASSO a ‘newer’ method, I realize that for many, there would be nothing novel about it at all.

One of the main reasons I am starting to develop a healthy skepticism about the traditional methods concerns ending up with models that contain many features, therefore resulting in models with high variance (and a sub-optimal bias-variance tradeoff). That makes a lot of intuitive sense to me; however, I’d love to be able to observe the difference as it would relate to a problem we might face in the health outcomes field. To that end, I’ve decided to apply both techniques to a cut of Framingham data. The main question being: Which model would do a better job at predicting 10-year CV risk? So, let’s put both methods to the test! May the best model win!

# load framingham dataset
options(stringsAsFactors=F)
framingham <- read.csv("~/R_folder/logreg/data/framingham.csv")
# remove observations with missing data for this example 
framingham <- na.omit(framingham)

# create test & training data
set.seed(1001)
testindex <- sample(1:nrow(framingham), round(.15*nrow(framingham), 0), replace=FALSE)
train <- framingham[-testindex, ]
test <-  framingham[testindex, ]

# first plain vanilla logistic regression (retaining only those variables that are statistically significant)
mod1 <- glm(TenYearCHD~., data=train, family="binomial")
summary(mod1) # retain only the features with p <0.05
vanilla <- glm(TenYearCHD~male+age+cigsPerDay+sysBP+glucose, data=train, family="binomial")

# use cv.lasso
# this method uses cross-validation to determine the optimal lambda 
# and therefore how many features to retain, and what coeffients
cv.lasso <- cv.glmnet(x.matrix, y.val, alpha=1, family="binomial")
plot(cv.lasso, xvar="lambda", label=TRUE)
coef(cv.lasso)

# assess accuracey of model derived from plain vanilla
# use your hold out test data
vanilla.prediction <- predict(vanilla, test, type="response")
# confusion matrix for plain vanilla log reg
table(test$TenYearCHD, vanilla.prediction >= 0.5)
# accuracy
mean(test$TenYearCHD == (vanilla.prediction >= 0.5))

# test lasso predictions
cv.lasso.prediction <- predict(cv.lasso, test.x.matrix, type="response")
# confusion matrix for lasso log reg with CV to choose lambda and best coefficients
table3 <- table(test$TenYearCHD, cv.lasso.prediction>=0.5 )
# accuaracy
mean(test$TenYearCHD == (cv.lasso.prediction>=0.5))

In this particular case, it appears to me that the old way produces a better model (accuracy of 0.842 vs. 0.838). While I’m not going to assume that this will always be the case, I had fun putting both models to the test!