What is the true underlying functional form of the relationship between LDL and CV events in secondary prevention?

Take it from a true bonehead (just ask my wife or kids...)

Take it from a true bonehead (just ask my wife or kids…)

As I mentioned in an earlier post, I’ve spent some time thinking about the relationship between low density lipoprotien (LDL)–or bad cholesterol–and CV events.  Now that I’ve got a few minutes to spare and the kids are in bed, it’s about time I organize those thoughts and pen a second post (in a three–or maybe four–post series) about them.

In my previous post, I extracted some data from previously published studies and recreated, what I consider to be, one of the most prolific meta-visualizations of trial evidence in cardiovascular disease. To be perfectly honest, my reasons for doing so were not just to have fun with R (I know–it’s a sickness), but to explore the underlying functional form of the model itself. To be specific, the TNT authors and at least one other analyses of similar data appear to all make the same assumption with respect to the linearity of the relationship between LDL and CV events (i.e., not only is lower better, but lower is better at a constant rate). In my review of the literature–which I will admit, may not be the most extensive–I could find only one report from the Cleveland Clinic Journal of Medicine which clearly suggested a nonlinear relationship between LDL and CV events (albeit across a mixture of primary and secondary prevention studies).

Practically speaking, the assumption of linearity may not result in any issues so long as readers confine their interpretations to a “reasonable” range of LDL levels, and practice caution not to extrapolate beyond the range of evaluated LDL values; however, newer–yet to be marketed–cholesterol lowering agents may make it possible for clinicians to take patient LDL levels to lows not readily achieved with any existing medications. In trying to assess the value of these new cholesterol lowering medications (provided they make it to approval), reimbursement agencies and health technology assessment organizations will have to rely on modeling to evaluate the long term clinical benefits. As a health outcomes researcher, I am acutely aware of how assumptions like the one I mention above can substantially alter the outcome of any modeling exercises. Especially in cases that depend on extraplating beyond the bounds of which you have data, ensuring you have the right model is extremely important.  For this reason, it is vial that we re-consider the implied assumption of linearity and at the very least apply sensitivity analyses when modeling any outcomes dependent upon the relationship of LDL  and CV events.

Since my previous post, I’ve extracted data from a few more secondary prevention trials (PROVE-IT and IDEAL). For the most part, the outcomes extracted include only those that captured CV death or non-fatal MI. You may notice that I did not include AtoZ. This was an intentional omission as the composite end point in this trial included hospital readmission for ACS.

First, take a look at the bivariate plot of LDL to CV events. Through this plot, you will notice that I have drawn both the OLS regression line (black solid) AND the quadratic regression line (red hashed).

file <- "http://dl.dropboxusercontent.com/u/27644144/secprevstatin.csv"
statins <- read.csv(file, header=T) # read raw data from dropbox
statins <- statins[ which(statins$Study != “AtoZ”),] # remove AtoZ
statins$year <- as.numeric(substr(statins$lastrand, 1, 4)) + round(as.numeric(substr(statins$lastrand, 5, 6))/13, 2) # code year -- to be used later as feature

df1 <- statins
yval <- 'eventrate'
pyval <- 'Event (%)'
xval <- 'LDL'
pxval <- 'LDL Cholesterol (mg/dL)'
par(mar=c(5.1, 4.1, 4.1, 12))
df1$pchvec <- ifelse( grepl(“PBO”, df1$Cohort), 1, 19 )
plot( df1[, xval], df1[, yval], type=“n”, ylim=c(0,30), xlim=c(40,210), yaxt='n', xaxt='n', ylab=““, xlab=““)
u <- par(“usr”)
rect(u[1], u[3], u[2], u[4], col = “grey95”, border = “black”)
abline(h = c(0, 5, 10, 15, 20, 25, 30), col='white', lwd=2) ##  draw h lines
abline(v = c(50,70, 90, 110, 130, 150, 170, 190, 210), col='white', lwd=2) ##  draw v lines
plot( df1[, xval], df1[, yval], pch=df1$pchvec, col= cpal , bg= df1$pchfill, cex=1.5  ,yaxt='n', xaxt='n', ylab=““, xlab=““, ylim=c(0,30), xlim=c(40,210), lwd=2)
axis(side=2, at=c(0, 5, 10, 15, 20, 25, 30), labels=c(“0”,”5”,'10', '15', '20', '25', '30'), las=1 )
axis(side=1, at=c(50, 70, 90, 110, 130, 150, 170, 190, 210), labels=c(“50” , “70”, “90”, “110”, “130”, “150”, “170”, “190”, “210”)  )
legend( “topleft”, pch=c(19, 1), legend=c(“Statin”, “Placebo”), cex=1.2, border= “n”, bty='n')
text(df1[, xval], df1[, yval], labels = df1$Study, pos = 3, font=2, col=cpal)
title(main=“Event Rates Plotted against LDL Cholesterol Levels\nduring Statin Therapy in Secondary-Prevention Studies 4S to PROVE-IT.”, ylab=pyval, xlab=pxval)
abline(lm(df1$eventrate~df1$LDL), lwd=2)
poly.plot <- lm(eventrate~poly(LDL, 2), data=df1)
poly.pred <- predict(poly.plot, df1[xval])
preddf <- data.frame(x = df1[,xval], y=poly.pred)
preddf <- preddf[ order(preddf$x),]
lines(x=preddf$x, y=preddf$y, type=“l”, col=“red”, lwd=3, lty=2)

From 4S to PROVE-IT

From 4S to PROVE-IT

Next, let’s estimate the RMSE for the linear and quadratic functions.

lmmod <- lm(eventrate~LDL, data=statins)
poly <- lm(eventrate~poly(LDL, 2), data=statins)

lmpred <- predict(lmmod, statins)
lmquad <- predict(poly, statins)

rmse.linear <- sqrt(mean((statins$eventrate-lmpred)^2))
rmse.quad <- sqrt(mean((statins$eventrate-lmquad )^2))
#[1] 2.339196
#[1] 1.994463

While I walked through the steps, it should come as no surprise that the model with more parameters did a better job fitting the data. The question now is: did the quadratic model overfit the data? Some additional interesting questions might be to consider how valid it might be to include some of the older studies (e.g., 4S) in an analysis like this? When the 4S trial was conducted, many advances in the pharmacologic treatment of high blood pressure and other risk factors known to reduce the risk of a second heart attack were yet to become part of common clinical practice (e.g., post-MI discharge order sets including BB, ACE/ARB, and antiplatelet therapy). Is there a way to use statistics to support removing 4S from our analysis? To take a stab at answering those questions and more, I will walk through some additional analyses in my next 2 or so posts.

Cleaning up unruly data – flattening a delimited column of composite string data

While there may be more  efficient ways of eating watermelon, there are none quite as entertaining...

While there may be more efficient ways of eating watermelon, there are none quite as entertaining…

I found myself dealing with some exceptionally unruly data that was part of a clinical table. There was something about my predicament that had me wondering why I had not stubbed my toe on such an issue earlier. The elegance of the solution–and the knowledge that this may come in handy again some time–prompted me to pen a quick post.

What I had was a table where there was one column that included multiple lookup values and those values were in a continuous string separated by commas. Such data is a nightmare to use in an analysis, but I can easily see how such a table would be created. This kind of a layout (entering string data in one cell, separated by commas) is intuitive to a clinician typing such a table into an excel spreadsheet.

To be specific, in my case, I was dealing with a table of drug and Jcode/HCPCS data where any one drug could correspond to multiple J/HCPCS codes. The column where the Jcodes/HCPCS codes resided contained cells that looked something like this: “ABC123, XYZ456, etc., etc.”. For my end product, I needed a table where each J/HCPCS code lived in it’s own cell next to the individual drug names (which could be listed multiple times).

I found this most excellent post on stackoverflow that led me out of my mess. Below I provide a toy example that illustrates how I implemented the solution myself.

Where the magic happens is in the use of sapply to return the individual lengths of each parsed string and how this vector of lengths is then part of a call to rep() using the vector of drug names. PURE.POETRY. I find it inspiring that just ONE LINE of code has the power to unmangle such a horrible mess. It highlights how useful list structures are and how they can be manipulated.

v1=c("drugA","drugB", "drugC")
v2=c("J1234, S5105", "J8499", "12345, J9999, X5555")
df = data.frame(drugs = v1 , hcpcs.jcodes = v2)

codes <- strsplit(df$hcpcs.jcodes, &quot;, &quot;)
data.frame(drugs = rep(df$drugs, sapply(codes, length)), jcodes.hcpcs = unlist(codes))#... magic happens here!

Recreating one of the most reproduced plots in the history of hypercholesterolemia clinical trails (using R)

The one-and-only Figure 4. from LaRossa et al.

The one-and-only Figure 4. from LaRossa et al.


Statins... Saving lives since 1987.

Statins… Saving lives since 1987.

What you are looking at is quite possibly the most replicated statin study data visualization of all time. If you have been to more than one lecture on either primary-or-secondary prevention with statins, you’ve seen this plot, or some adaptation. It comes from page 1434 of the New England Journal of Medicine volume 352, issue 14. It was used in the discussion section by the authors of the TNT trial (LaRosa et al. N Engl J Med. 2005 Apr 7;352(14):1425-35.) to put the results of TNT into the broader context of the extant secondary prevention literature of the day. I’ve been thinking a lot about these data recently (for reasons which I’ll leave for another post), and wanted to manipulate some of it in R, and as a part of that exercise I decided to re-create (as best I could using only data form the published trials) and perhaps improve upon the plot.

options(stringsAsFactors=F, scipen=999)
file <- "http://dl.dropboxusercontent.com/u/27644144/secprevstatin.csv" # data is here--I extracted it the best I could from the various landmark trials
statins <- read.csv(file, header=T)

##  slightly tweaked (I think this just looks better)

df1 <- statins
yval <- 'eventrate'
pyval <- 'Event (%)'
xval <- 'LDL'
pxval <- 'LDL Cholesterol (mg/dL)'

##  slightly tweaked (I think this just looks better)
par(mar=c(5.1, 4.1, 4.1, 12))
df1$pchvec <- ifelse( grepl("PBO", df1$Cohort), 1, 19 )
plot( df1[, xval], df1[, yval], type="n", ylim=c(0,30), xlim=c(40,210), yaxt='n', xaxt='n', ylab="", xlab="")
u <- par("usr")
rect(u[1], u[3], u[2], u[4], col = "grey95", border = "black")
abline(h = c(0, 5, 10, 15, 20, 25, 30), col='white', lwd=2) ##  draw h lines
abline(v = c(50,70, 90, 110, 130, 150, 170, 190, 210), col='white', lwd=2) ##  draw v lines
plot( df1[, xval], df1[, yval], pch=df1$pchvec, col= cpal , bg= df1$pchfill, cex=1.5  ,yaxt='n', xaxt='n', ylab="", xlab="", ylim=c(0,30), xlim=c(40,210), lwd=2)
axis(side=2, at=c(0, 5, 10, 15, 20, 25, 30), labels=c("0","5",'10', '15', '20', '25', '30'), las=1 )
axis(side=1, at=c(50, 70, 90, 110, 130, 150, 170, 190, 210), labels=c("50" , "70", "90", "110", "130", "150", "170", "190", "210")  )
legend( "topleft", pch=c(19, 1), legend=c("Statin", "Placebo"), cex=1.2, border= "n", bty='n')
text(df1[, xval], df1[, yval], labels = df1$Study, pos = 3, font=2, col=cpal)
title(main="Figure 4. Event Rates Plotted against LDL Cholesterol Levels\nduring Statin Therapy in Secondary-Prevention Studies.", ylab=pyval, xlab=pxval)
abline(lm(df1$eventrate~df1$LDL), lwd=2)
poly.plot <- lm(eventrate~poly(LDL, 2), data=df1)
poly.pred <- predict(poly.plot, df1[xval])
preddf <- data.frame(x = df1[,xval], y=poly.pred)
preddf <- preddf[ order(preddf$x),]
lines(x=preddf$x, y=preddf$y, type="l", col="red", lwd=3, lty=2)

Predicting 10-year CV risk with Framingham data using logistic regression: The ‘good old’ way vs. LASSO–two competing models

When models compete, it can get pretty ugly (or... let's face it, kinda ridiculous actually)

When models compete, it can get pretty ugly (or… let’s face it, kinda ridiculous actually)

I have recently been thinking a lot about both CV risk modeling (esp. given some of the controversy the newest ATP 10-year ASCVD risk calculator has ignited) and regression methods with cross-validation. Now, putting the particular debate around the newer ATP 10-year risk modeler aside, I found myself wondering how different ways of applying regression methods to a problem can result in models with different predictive performance. Specifically, what I’m referring to is comparing the performance of the ‘good old’ method of model selection–and by that I mean the way I was taught to apply it in graduate school–to a ‘newer’ regression shrinkage method (like LR with lasso). Now, when I say the “good old” way, I’m referring to a method that involves: 1) throwing all the features into the model, 2) evaluating the significance of each feature for selection, and 3) re-specifying the model with only those features you retained from step2. Also, while I might call LR with LASSO a ‘newer’ method, I realize that for many, there would be nothing novel about it at all.

One of the main reasons I am starting to develop a healthy skepticism about the traditional methods concerns ending up with models that contain many features, therefore resulting in models with high variance (and a sub-optimal bias-variance tradeoff). That makes a lot of intuitive sense to me; however, I’d love to be able to observe the difference as it would relate to a problem we might face in the health outcomes field. To that end, I’ve decided to apply both techniques to a cut of Framingham data. The main question being: Which model would do a better job at predicting 10-year CV risk? So, let’s put both methods to the test! May the best model win!

# load framingham dataset
framingham <- read.csv("~/R_folder/logreg/data/framingham.csv")
# remove observations with missing data for this example 
framingham <- na.omit(framingham)

# create test & training data
testindex <- sample(1:nrow(framingham), round(.15*nrow(framingham), 0), replace=FALSE)
train <- framingham[-testindex, ]
test <-  framingham[testindex, ]

# first plain vanilla logistic regression (retaining only those variables that are statistically significant)
mod1 <- glm(TenYearCHD~., data=train, family="binomial")
summary(mod1) # retain only the features with p <0.05
vanilla <- glm(TenYearCHD~male+age+cigsPerDay+sysBP+glucose, data=train, family="binomial")

# use cv.lasso
# this method uses cross-validation to determine the optimal lambda 
# and therefore how many features to retain, and what coeffients
cv.lasso <- cv.glmnet(x.matrix, y.val, alpha=1, family="binomial")
plot(cv.lasso, xvar="lambda", label=TRUE)

# assess accuracey of model derived from plain vanilla
# use your hold out test data
vanilla.prediction <- predict(vanilla, test, type="response")
# confusion matrix for plain vanilla log reg
table(test$TenYearCHD, vanilla.prediction >= 0.5)
# accuracy
mean(test$TenYearCHD == (vanilla.prediction >= 0.5))

# test lasso predictions
cv.lasso.prediction <- predict(cv.lasso, test.x.matrix, type="response")
# confusion matrix for lasso log reg with CV to choose lambda and best coefficients
table3 <- table(test$TenYearCHD, cv.lasso.prediction>=0.5 )
# accuaracy
mean(test$TenYearCHD == (cv.lasso.prediction>=0.5))

In this particular case, it appears to me that the old way produces a better model (accuracy of 0.842 vs. 0.838). While I’m not going to assume that this will always be the case, I had fun putting both models to the test!

Cleaning up detailed D.0 compound claim files–many columns of data into one cell

compound_medicinesOne of the burning client issues that I find I am focused on–as of late–is a disturbing increase in the proportion of very high cost compounded drug claims. While I am not going to spend any time debating the clinical value of these claims, I think anyone would agree that a good first step in forming an evidence-based approach to developing an opinion on these claims is to examine them (How many do you see? What do they cost? Does the costing basis make sense to you? What ingredients go into them? etc.,).

In the new D.0 claim format, compounding pharmacies are given the ‘opportunity’ to list each individual ingredient used to make up the compounded product. So one of the things you may want to do is take all of the line-by-line ingredients that make up a claim and create data where each set of ingredients are aggregated into one line. This is a helpful chunk of code that does just that. It makes use of both Hadley’s dcast() function–part of the reshape2 package, and the apply() function (one of a very handy set of _apply functions in R).

#Make some data
ingredient_name <- c("Drug A Phosphate Cap", "Flavor", "*Distilled Water*", 
"Drug X Inj Susp", "Lidocaine", 
"Super Drug HCl Liquid", "Not that great Inj Susp", 
"Antibiotic HCl Cap", "Drug A HCl Liquid", "Antifungal (Bulk)", 
"Table Salt Bicarbonate (Bulk)", "Antifungal (Bulk)", "Table Salt Bicarbonate (Bulk)", 
"Drug A Phosphate Cap", "Flavors", "*Distilled Water*", 
"Drug A Phosphate Cap", "Antifungal (Bulk)", "Table Salt Bicarbonate (Bulk)", 
"Super Drufg Acetonide Inj Susp", "Emollient**", 
"Antifungal (Bulk)", "Table Salt Bicarbonate (Bulk)")
 claim_id <- c(1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 5, 5, 6, 6, 6, 7, 8, 8, 9, 
9, 10, 10)
df1 <- data.frame(claim_id, ingredient_name)
#Let's look at what you have
df2 <- dcast(df1, claim_id~ingredient_name)
#NOTE: fantastic side effect of dcast is that it alphabetizes columns by column name
#This ensures that for different claims the ordering of the ingredients follows the same rules!
cols <- names(df2)[-1]
df2$all_ingred <- apply( df2[ , cols ] , 1 , paste , collapse = ", " ) # combine all the columns into one
df2$all_ingred<- gsub("NA, |NA", "", df2$all_ingred)#clean up
df2$all_ingred<- gsub(", $", "", df2$all_ingred)#more clean up
df3 <- df2[c("claim_id", "all_ingred")]#even more clean up
#Now give it a look

A really handy function for cross-validating a set of models from MLR + forward selection

By this it appears that the best model to predict baseball salaries includes 5 features.

By this it appears that the best model to predict baseball salaries includes 5 features.

I’ve been having a lot of fun taking part in a Stanford MOOC called “Introduction to Statistical Learning” offered by Dr. Trevor Hastie and Dr. Robert Tibshirani. The class parallels the book of the same name. The course has been AMAZING so far. And so far we’ve even been blessed with a cameo appearance from another author Daniela Witten! I wonder ,if before it’s over , we’ll hear from Gareth James (another one of the ISLR authors).

Anyhow, the class recently spent some time on model selection. We covered best subset, forward selection, backward selection, ridge regression and the lasso. One of the big takeaways for me has been the value of doing cross-validation (or K-fold cross-validation) to select models (vs. relying on Cp, AIC or BIC). Well, given you can even calculate them (for ridge and lasso, since you don’t really know the number of coefficients, d, you couldn’t even estimate these if you wanted to). That said, Dr. Hastie let the class in on a very handy custom function for cross-validation. I spent some time walking through it and present it (step-by-step) below.

In the example the dependent variable (Salary) is quantitative, and the features (IVs) are of various types. The object is to find the linear model that best captures the relationship b/w the features and Salary. They key here is to avoid over-fitting the model. We seek to optimize the variance-bias tradeoff, and so instead of relying on R-squared or even other corrected/improved indicators like Cp, AIC and BIC, in this case we employ cross-validation to determine the size of the model… I’m becoming a fan of this approach.

First, here’s the chunk all at once:

train=sample(seq(nrow(Hitters)),round(2/3*nrow(Hitters), 0) ,replace=FALSE) # create training sample of size 2/3 of total sample
regfit.fwd=regsubsets(Salary~.,data=Hitters[train,],nvmax=19,method="forward")# fit model on training sample

val.errors=rep(NA,19) # there are 19 subset models due to nvmax = 19 above
x.test=model.matrix(Salary~.,data=Hitters[-train,])# notice the -index!... we are indexing by minus train, so this is our test sammple
for(i in 1:19){
coefi=coef(regfit.fwd,id=i)# returns coeficients only for model model i from 1-19
pred=x.test[,names(coefi)]%*%coefi # names(coefi) pulls only columns from model = i, then matrix multiply by the coeficients from i (coefi)
plot(sqrt(val.errors),ylab="Root MSE",ylim=c(300,400),pch=19,type="b")

Line 1 (Building a place to store stuff)


Line 1 is all about efficiency, what you are doing here is, essentially, building an empty vector of 19 NAs (which will serve as a kind of shelf to store something you will make later on). Those somethings are the 19 test RMSE, that you will produce with your function.

Line 2 (gathering one key ingredient)

x.test=model.matrix(Salary~.,data=Hitters[-train,])# notice the -index!

This is where you create your test set of data. I think the really cool thing going on in this line is how the model.matrix function works. By running the piece of code below, you’ll see that it produces something similar to Hitters[-train,], but by using some additional specification (in the formula Salary~.) you tell it to remove the salary variable and include a vector of 1s which it reserves for an Intercept term (important during the matrix multiplication step).
Try this and you can see how model.matrix does it’s work:


Line 3 (Getting loopy…)

for(i in 1:19){

This line just specifies how you want your loop to run. Remember you are looping from 1-19 because you created 19 distinct models using the nvmax=19 above in regsubsets().

Line 4 (Why looping constructs are cool)


Here you are putting the index (i) that the loop is looping through to good use. As the loop does it’s thing it will use this piece of code to create 19 different vectors, one for each model returned by regsubsets above. Don’t take my word for it, try this:


Line 5 (The magic of matrix multiplication)


Lots of good stuff going on here, one of which being the use of the names() function to ensure that for each model you are only doing multiplication on the correct number of columns, but here was one of my big A-HAs. What you have to remember (or realize) here is that the vector coefi (which you create above) is multiplied by the matrix x.test (which is of dimensions nxp). This will return a vector of nX1. Where each scalar is the prediction value for the model of interest. At the end of the day pred will contain all of your individual yhats. Don’t believe me, try this:


Line 6 (Time to calculate your individual test errors and put them somewhere)


Here is the line where you are putting all of the individual ingredients, making your final product, and putting it away for safe keeping (remember that vector of NA you built earlier). In this piece:


..you are calculating your MSE using your yhats (pred) and your actual ys (Hitters$Salary[-train]). You use the index from your loop to put your MSE away in its respective spot in the vector of NAs with val.errors[i].

The loop will iterate over all your i (1-19), and that’s how it all comes together! Very cool indeed. While you can certainly don’t need to take a look under the hood, I find that to really “get” the pleasure, it helps to stop and think about the complexity. (I think I’m really butchering a Feynman quote there).

The rest of the code just does some plotting so you can visualize at which model size you reach the minimum of RMSE. The overlay of Rsq is a nice touch as well.

Just in case you were curious, the 5 coefficient model happens to be this one (it is important to note that Salary is in $K):

Salary = 145.54 + 6.27*Walks + 1.19*CRuns – 0.803*CWalks – 159.82*DivisionW + 0.372*PutOuts

I thought it was strange that it doesn’t appear to matter in what league you played, but it did matter what division you played in (most likely an AL East “effect”). Anyhow, to learn more about the individual features of the model you can see further documentation on the Hitters dataset here.

Curating data to eliminate partial quarters, months (of even years if that’s your game)…

Step 1: Insert data, Step2: collect money... If it were only that simple...

Step 1: Insert data, Step2: collect money… If it were only that simple…

Often, I’m pulling data in to R from datasets that are updated monthly; however, there are many cases where I am interested in aggregating data by quarters. In these cases I need to make sure that in each aggregation bin, the data represent full quarters (and full months as well–but I tacked that in an earlier post).

Most of the time I use RODBC to bring my data from our warehouse into R, and I suppose you could implement the technique below as part of the data import step, but in this particular case, I implemented the code below after the data had been imported into r.

## simulate some data

n <- 100
date <- seq(as.Date('2011-01-01'),as.Date('2012-04-30'),by = 1)
values <- rnorm(length(date), n, 25)
df1 <- data.frame(date, values)
df1$yrqtr <- as.yearqtr(df1$date)
df1$yrmon <- as.yearmon(df1$date)

tapply(df1$values, df1$yrqtr, sum) # beware the last quarter is incomplete
range(df1$yrqtr) #you can't tell by looking here
range(df1$yrmon) #but you can tell by looking here

# this code fixes your problem
test1 <- as.yearmon(as.Date(max(df1$yrqtr), frac=1)) ## returns the last month of the last QTR that we have data for
result1 <- as.yearmon(as.Date(max(df1$yrqtr), frac=0)) ##  returns the FIRST month ofthe last QTR what we have data for
#if max yearmon in data is not equal to what should be the last yearmon for the last qtr in data
# cut  data to last full quarter 
if(max(df1$yrmon) != test1)
  df1 <- df1[ df1$yrmon < result1 , ]

tapply(df1$values, df1$yrqtr, sum) # this is more like it!