I have recently been thinking a lot about both CV risk modeling (esp. given some of the controversy the newest ATP 10-year ASCVD risk calculator has ignited) and regression methods with cross-validation. Now, putting the particular debate around the newer ATP 10-year risk modeler aside, I found myself wondering how different ways of applying regression methods to a problem can result in models with different predictive performance. Specifically, what I’m referring to is comparing the performance of the ‘good old’ method of model selection–and by that I mean the way I was taught to apply it in graduate school–to a ‘newer’ regression shrinkage method (like LR with lasso). Now, when I say the “good old” way, I’m referring to a method that involves: 1) throwing all the features into the model, 2) evaluating the significance of each feature for selection, and 3) re-specifying the model with only those features you retained from step2. Also, while I might call LR with LASSO a ‘newer’ method, I realize that for many, there would be nothing novel about it at all.
One of the main reasons I am starting to develop a healthy skepticism about the traditional methods concerns ending up with models that contain many features, therefore resulting in models with high variance (and a sub-optimal bias-variance tradeoff). That makes a lot of intuitive sense to me; however, I’d love to be able to observe the difference as it would relate to a problem we might face in the health outcomes field. To that end, I’ve decided to apply both techniques to a cut of Framingham data. The main question being: Which model would do a better job at predicting 10-year CV risk? So, let’s put both methods to the test! May the best model win!
# load framingham dataset options(stringsAsFactors=F) framingham <- read.csv("~/R_folder/logreg/data/framingham.csv") # remove observations with missing data for this example framingham <- na.omit(framingham) # create test & training data set.seed(1001) testindex <- sample(1:nrow(framingham), round(.15*nrow(framingham), 0), replace=FALSE) train <- framingham[-testindex, ] test <- framingham[testindex, ] # first plain vanilla logistic regression (retaining only those variables that are statistically significant) mod1 <- glm(TenYearCHD~., data=train, family="binomial") summary(mod1) # retain only the features with p <0.05 vanilla <- glm(TenYearCHD~male+age+cigsPerDay+sysBP+glucose, data=train, family="binomial") # use cv.lasso # this method uses cross-validation to determine the optimal lambda # and therefore how many features to retain, and what coeffients cv.lasso <- cv.glmnet(x.matrix, y.val, alpha=1, family="binomial") plot(cv.lasso, xvar="lambda", label=TRUE) coef(cv.lasso) # assess accuracey of model derived from plain vanilla # use your hold out test data vanilla.prediction <- predict(vanilla, test, type="response") # confusion matrix for plain vanilla log reg table(test$TenYearCHD, vanilla.prediction >= 0.5) # accuracy mean(test$TenYearCHD == (vanilla.prediction >= 0.5)) # test lasso predictions cv.lasso.prediction <- predict(cv.lasso, test.x.matrix, type="response") # confusion matrix for lasso log reg with CV to choose lambda and best coefficients table3 <- table(test$TenYearCHD, cv.lasso.prediction>=0.5 ) # accuaracy mean(test$TenYearCHD == (cv.lasso.prediction>=0.5))
In this particular case, it appears to me that the old way produces a better model (accuracy of 0.842 vs. 0.838). While I’m not going to assume that this will always be the case, I had fun putting both models to the test!