A really handy function for cross-validating a set of models from MLR + forward selection

By this it appears that the best model to predict baseball salaries includes 5 features.

By this it appears that the best model to predict baseball salaries includes 5 features.

I’ve been having a lot of fun taking part in a Stanford MOOC called “Introduction to Statistical Learning” offered by Dr. Trevor Hastie and Dr. Robert Tibshirani. The class parallels the book of the same name. The course has been AMAZING so far. And so far we’ve even been blessed with a cameo appearance from another author Daniela Witten! I wonder ,if before it’s over , we’ll hear from Gareth James (another one of the ISLR authors).

Anyhow, the class recently spent some time on model selection. We covered best subset, forward selection, backward selection, ridge regression and the lasso. One of the big takeaways for me has been the value of doing cross-validation (or K-fold cross-validation) to select models (vs. relying on Cp, AIC or BIC). Well, given you can even calculate them (for ridge and lasso, since you don’t really know the number of coefficients, d, you couldn’t even estimate these if you wanted to). That said, Dr. Hastie let the class in on a very handy custom function for cross-validation. I spent some time walking through it and present it (step-by-step) below.

In the example the dependent variable (Salary) is quantitative, and the features (IVs) are of various types. The object is to find the linear model that best captures the relationship b/w the features and Salary. They key here is to avoid over-fitting the model. We seek to optimize the variance-bias tradeoff, and so instead of relying on R-squared or even other corrected/improved indicators like Cp, AIC and BIC, in this case we employ cross-validation to determine the size of the model… I’m becoming a fan of this approach.

First, here’s the chunk all at once:

train=sample(seq(nrow(Hitters)),round(2/3*nrow(Hitters), 0) ,replace=FALSE) # create training sample of size 2/3 of total sample
regfit.fwd=regsubsets(Salary~.,data=Hitters[train,],nvmax=19,method="forward")# fit model on training sample

val.errors=rep(NA,19) # there are 19 subset models due to nvmax = 19 above
x.test=model.matrix(Salary~.,data=Hitters[-train,])# notice the -index!... we are indexing by minus train, so this is our test sammple
for(i in 1:19){
coefi=coef(regfit.fwd,id=i)# returns coeficients only for model model i from 1-19
pred=x.test[,names(coefi)]%*%coefi # names(coefi) pulls only columns from model = i, then matrix multiply by the coeficients from i (coefi)
plot(sqrt(val.errors),ylab="Root MSE",ylim=c(300,400),pch=19,type="b")

Line 1 (Building a place to store stuff)


Line 1 is all about efficiency, what you are doing here is, essentially, building an empty vector of 19 NAs (which will serve as a kind of shelf to store something you will make later on). Those somethings are the 19 test RMSE, that you will produce with your function.

Line 2 (gathering one key ingredient)

x.test=model.matrix(Salary~.,data=Hitters[-train,])# notice the -index!

This is where you create your test set of data. I think the really cool thing going on in this line is how the model.matrix function works. By running the piece of code below, you’ll see that it produces something similar to Hitters[-train,], but by using some additional specification (in the formula Salary~.) you tell it to remove the salary variable and include a vector of 1s which it reserves for an Intercept term (important during the matrix multiplication step).
Try this and you can see how model.matrix does it’s work:


Line 3 (Getting loopy…)

for(i in 1:19){

This line just specifies how you want your loop to run. Remember you are looping from 1-19 because you created 19 distinct models using the nvmax=19 above in regsubsets().

Line 4 (Why looping constructs are cool)


Here you are putting the index (i) that the loop is looping through to good use. As the loop does it’s thing it will use this piece of code to create 19 different vectors, one for each model returned by regsubsets above. Don’t take my word for it, try this:


Line 5 (The magic of matrix multiplication)


Lots of good stuff going on here, one of which being the use of the names() function to ensure that for each model you are only doing multiplication on the correct number of columns, but here was one of my big A-HAs. What you have to remember (or realize) here is that the vector coefi (which you create above) is multiplied by the matrix x.test (which is of dimensions nxp). This will return a vector of nX1. Where each scalar is the prediction value for the model of interest. At the end of the day pred will contain all of your individual yhats. Don’t believe me, try this:


Line 6 (Time to calculate your individual test errors and put them somewhere)


Here is the line where you are putting all of the individual ingredients, making your final product, and putting it away for safe keeping (remember that vector of NA you built earlier). In this piece:


..you are calculating your MSE using your yhats (pred) and your actual ys (Hitters$Salary[-train]). You use the index from your loop to put your MSE away in its respective spot in the vector of NAs with val.errors[i].

The loop will iterate over all your i (1-19), and that’s how it all comes together! Very cool indeed. While you can certainly don’t need to take a look under the hood, I find that to really “get” the pleasure, it helps to stop and think about the complexity. (I think I’m really butchering a Feynman quote there).

The rest of the code just does some plotting so you can visualize at which model size you reach the minimum of RMSE. The overlay of Rsq is a nice touch as well.

Just in case you were curious, the 5 coefficient model happens to be this one (it is important to note that Salary is in $K):

Salary = 145.54 + 6.27*Walks + 1.19*CRuns – 0.803*CWalks – 159.82*DivisionW + 0.372*PutOuts

I thought it was strange that it doesn’t appear to matter in what league you played, but it did matter what division you played in (most likely an AL East “effect”). Anyhow, to learn more about the individual features of the model you can see further documentation on the Hitters dataset here.

Curating data to eliminate partial quarters, months (of even years if that’s your game)…

Step 1: Insert data, Step2: collect money... If it were only that simple...

Step 1: Insert data, Step2: collect money… If it were only that simple…

Often, I’m pulling data in to R from datasets that are updated monthly; however, there are many cases where I am interested in aggregating data by quarters. In these cases I need to make sure that in each aggregation bin, the data represent full quarters (and full months as well–but I tacked that in an earlier post).

Most of the time I use RODBC to bring my data from our warehouse into R, and I suppose you could implement the technique below as part of the data import step, but in this particular case, I implemented the code below after the data had been imported into r.

## simulate some data

n <- 100
date <- seq(as.Date('2011-01-01'),as.Date('2012-04-30'),by = 1)
values <- rnorm(length(date), n, 25)
df1 <- data.frame(date, values)
df1$yrqtr <- as.yearqtr(df1$date)
df1$yrmon <- as.yearmon(df1$date)

tapply(df1$values, df1$yrqtr, sum) # beware the last quarter is incomplete
range(df1$yrqtr) #you can't tell by looking here
range(df1$yrmon) #but you can tell by looking here

# this code fixes your problem
test1 <- as.yearmon(as.Date(max(df1$yrqtr), frac=1)) ## returns the last month of the last QTR that we have data for
result1 <- as.yearmon(as.Date(max(df1$yrqtr), frac=0)) ##  returns the FIRST month ofthe last QTR what we have data for
#if max yearmon in data is not equal to what should be the last yearmon for the last qtr in data
# cut  data to last full quarter 
if(max(df1$yrmon) != test1)
  df1 <- df1[ df1$yrmon < result1 , ]

tapply(df1$values, df1$yrqtr, sum) # this is more like it!

Evaluating across many columns within the same row to extract the first occurrence of a string

Does your job make you feel like this sometimes?? Yup, me too!

Does your job make you feel like this sometimes?? Yup, me too!

Some time ago I was tasked with looking at a set of claims data to pull only claim lines that had a certain diagnosis, and even more importantly, if there were claims where multiple diagnoses were listed, I needed to pull only the “most important” (let’s just call it that). It is not the purpose of this post to debate the consideration of attributing a weighting to the importance of a diagnoses based on the order of appearance in the claim files after the first position, so let’s not even go there. This is just more about applying a method that may have application in a more broader context.

For those of you who may not be familiar, the claim files that I had listed (for each Patient and Service Date) a series of fields indicating different diagnoses (e.g., Principal Diagnosis, Diagnosis Code 2, Diagnosis Code 3). The condition that I was tasked with looking into was made up of a composite of diagnosis codes, and in some cases a patient may have one of the diagnoses codes (making up the composite) in position #1, and another in position #2 (or even position #3). However, we did not want to count patients more than once for each claim line. Rather, we wanted to pull forward the first diagnosis (matching any in the composite set) as we “looked” at the set of diagnoses code columns, working our way down Diagnosis Code Principal, then Diagnosis code #2, and #3.

Enough of the background, it is better to just show you what is going on with some code…

One last note, I used a looping construct in this version of my solution. Recently, I think I’ve stumbled on a way to do the same with one of the apply family of functions.


# simulate some data
Diagnosis_Code_Principal <- sample(c("heart attack", "unstable angina", "broken arm", "stomach ache", "stubbed toe", "ingrown hair", "tooth ache"), 1000, replace=TRUE)
Diagnosis_Code_02 <- sample(c("heart attack", "unstable angina", "broken arm", "stomach ache", "stubbed toe", "ingrown hair", "tooth ache"), 1000, replace=TRUE)
Diagnosis_Code_03 <- sample(c("heart attack", "unstable angina", "broken arm", "stomach ache", "stubbed toe", "ingrown hair", "tooth ache"), 1000, replace=TRUE)
Service_Date_MMDDYYYY <- sample( seq.Date(as.Date("2011/1/1"), as.Date("2011/12/31"), by="day"), 1000, replace=TRUE)
Person_ID <- sample( c("Person1", "Person2", "Person3", "Person4"),1000, replace=TRUE)
eventdata <- data.frame(Person_ID, Service_Date_MMDDYYYY,Diagnosis_Code_Principal, Diagnosis_Code_02, Diagnosis_Code_03)
eventdata <- eventdata[ order(eventdata$Person_ID, eventdata$Service_Date_MMDDYYYY), ]

keydx <- c("heart attack|unstable angina") # these are the "codes" that make up our composite 

eventdata <- unique(eventdata)## duplicate lines are uninformative

eventdata$keep <- grepl((keydx), eventdata$Diagnosis_Code_Principal) | grepl((keydx), eventdata$Diagnosis_Code_02) | grepl((keydx), eventdata$Diagnosis_Code_03) ## more slimming down of the data
eventdata <- eventdata[ eventdata$keep == TRUE, ] 

##  creates a vector where the first match is returned from a set of columns string
firstdx <- rep(0, nrow(eventdata))
for(i in 1:nrow(eventdata)){
  a <- rep(0,3)
  a <- c(eventdata[i, "Diagnosis_Code_Principal"], eventdata[i, "Diagnosis_Code_02"], eventdata[i, "Diagnosis_Code_03"])## you can list as many columns as you like.
  if (any(grepl(keydx, a)))
    b <- a[grep(keydx, a)[1]] # keydx is listing of ICD9s that match dx grouping you are interested in
  if (!any(grepl(keydx, a))) ##  takes care of any cases where there is no match
    b <- NA      
  firstdx[i] <- b
eventdata$firstdx <- firstdx ## add the firstdx column to the dataset

Hacking base R plot code to do what facet does in ggplot2… Going about things the hard way

Another demonstration of doing things the hard way

Another demonstration of doing things the hard way

I have waffled back-and-forth from base plotting to ggplot2 back to base plotting for my every day plotting needs. Mainly, this is because when it comes to customization of all aspects of the plot (esp. the legend) I feel more more in command with the base R plotting code. That said, one of the great benefits of ggplot is efficiency and how the package allows users to do quite a lot with very few lines. I certainly still find ggplot2 to be a very handy package!
One of these killer features is the facet option. To achieve something similar with base R takes quite a bit of code to achieve (as far as I can tell), and while I have managed to hackishly create my own base plot work-around, it certainly is far from elegant.

df1 <- structure(list(yearmon = structure(c(1962.66666666667, 1962.75, 
1962.83333333333, 1962.91666666667, 1963, 1963.08333333333, 1963.16666666667, 
1963.25, 1963.33333333333, 1963.41666666667, 1963.5, 1963.58333333333, 
1962.66666666667, 1962.75, 1962.83333333333, 1962.91666666667, 
1963, 1963.08333333333, 1963.16666666667, 1963.25, 1963.33333333333, 
1963.41666666667, 1963.5, 1963.58333333333, 1962.66666666667, 
1962.75, 1962.83333333333, 1962.91666666667, 1963, 1963.08333333333, 
1963.16666666667, 1963.25, 1963.33333333333, 1963.41666666667, 
1963.5, 1963.58333333333), class = "yearmon"), Drug_Name = c("Agent 1", 
"Agent 1", "Agent 1", "Agent 1", "Agent 1", "Agent 1", "Agent 1", 
"Agent 1", "Agent 1", "Agent 1", "Agent 1", "Agent 1", "Agent 2", 
"Agent 2", "Agent 2", "Agent 2", "Agent 2", "Agent 2", "Agent 2", 
"Agent 2", "Agent 2", "Agent 2", "Agent 2", "Agent 2", "Agent 3", 
"Agent 3", "Agent 3", "Agent 3", "Agent 3", "Agent 3", "Agent 3", 
"Agent 3", "Agent 3", "Agent 3", "Agent 3", "Agent 3"), adjrx = c(18143.5783886275, 
38325.3886392513, 28947.4502791512, 48214.462366663, 43333.2885400775, 
33764.6938232197, 35212.886019669, 36189.6070599246, 28200.3430203372, 
43933.5384644003, 46732.6291571359, 60815.5882493688, 15712.9069922491, 
19251.420642945, 25798.4830512904, 33358.078739438, 44149.0834359141, 
43398.7462134831, 54262.7250247334, 66436.6057335244, 69902.3540414917, 
65782.8992544251, 80473.8038710182, 77450.9502630631, 54513.3449101778, 
69888.3308038326, 73786.2648409879, 108656.505665252, 179029.671628446, 
139676.077252012, 188805.180975972, 199308.502689428, 216174.290372019, 
249180.973882092, 189528.429468574, 261748.967406539)), .Names = c("yearmon", 
"Drug_Name", "adjrx"), class = "data.frame", row.names = c(36L, 
39L, 45L, 38L, 41L, 37L, 42L, 34L, 44L, 40L, 35L, 43L, 20L, 17L, 
18L, 15L, 16L, 12L, 14L, 13L, 19L, 21L, 11L, 10L, 33L, 25L, 24L, 
32L, 23L, 22L, 28L, 26L, 30L, 27L, 31L, 29L))

yrange <- paste('from', paste(range(df1$yearmon), collapse="-"))

yval <- 'adjrx'
loopcol <- 'Drug_Name'
xval <- 'yearmon'
ylabtxt <- 'ADJRx'
xlabtxt <- 'Months'
titletxt <- paste(client, ptargetclass, 'Adjusted Rx by Drug Name by Month\n from', yrange)

# ppi <- 300
# png(paste(client, targetclass, 'Drug_Name_adjrx_bymo.png', sep="_"), width=10*ppi, height=6*ppi, res=ppi)
par(mar=c(5.1, 4.1, 4.1, 12.2))
ymax <-  max( df1[c(yval)])+(0.1* max( df1[c(yval)]))
ymin <-  min( df1[c(yval)])-(0.1* min( df1[c(yval)]))
xmax <- max(df1[,c(xval)])
xmin <- min(df1[,c(xval)])
loopvec <- unique(df1[,loopcol])

cpal <- brewer.pal(length(loopvec), 'Set2')
plot( df1[,xval], df1[,yval],yaxt='n', xaxt='n', ylab="", xlab="", ylim=c(ymin,ymax))
u <- par("usr")
rect(u[1], u[3], u[2], u[4], col = "gray88", border = "black")
abline(h = pretty(ymin:ymax), col='white') ##  draw h lines
abline(v = (unique(df1[,xval])), col='white') ##  draw v lines
for (i in 1:length(loopvec)){
  loopi <- loopvec[i] ##  calls variable to be plotted
  sgi <- df1[ df1[,c(loopcol)] == loopi, ]
  sgi <- sgi[order(sgi[,c(xval)]),]
  plot( sgi[,xval], sgi[, yval], type="o", col=cpal[i], lwd=2, lty=i,yaxt='n', cex.axis=.7, cex.sub=.8, cex.lab=.8, xlab= "", ylab="", ylim=c(ymin,ymax), xlim=c(xmin, xmax)  ,sub=paste('' ), cex.sub=.8)
  if (i < length(loopvec))
##draw OLS for total

axis(side=2, at= pretty(range(ymin, ymax)), labels=pretty(range(ymin, ymax)), cex.axis=.75, )
mtext(ylabtxt, side=2, line=2, cex.lab=1,las=2,  las=0, font=2)
mtext(xlabtxt, side=1, line=2, cex.lab=1,las=2,  las=0, font=2)
mtext( titletxt, side=3, line=1, font=2, cex=1.2)
legend(xpd=TRUE,'right', inset=c(-0.30,0), legend=loopvec, lwd=rep(2, length(loopvec)), pch=rep(1, length(loopvec)), col=cpal, lty= 1:length(loopvec) ,title="LEGEND", bty='n' , cex=.8)
# dev.off()

Forecasting Atmospheric CO2 levels out to 2023 with R and the forecast package

globalwarming_noticeCO2ts_forecast CO2ts_decompose Some time ago, I stumbled upon Rob Hyndman and George Athana¬≠sopou¬≠los’s online text “Forecasting: principles and practice”, and I’ve been really excited to put Rob Hyndman’s forecast package to use, which I did with some client data that I had lying around. Since I could not post the details of that excercise here, I went in search of some publicly available data that I could showcase, and I stumbled upon the Mauna Loa observatory atmospheric CO2 readings. These data describe average atmospheric CO2 readings that go back as far as March of 1958. After decomposing the time series with STL decomposition, I used forecast to forecast what the next 10 years may have in store, given the current trends and pattern of seasonality. There is nothing new or interesting about the trend (check out the trend portion of the decomposition, we’ve all seen these data already reported in one form or another), and initially, while I thought that the pattern of the RATIO of “high-side” remainders to “low-side” remainders appeared to have changed in the early vs. the later years (check out the remainder portion of the decomposition). Where, in the early years, the “high side” remainders were more frequent and of greater relative magnitude then the “low-side” remainders. I did some back-of-the-envelope analysis of the remainders and found that this initial thought doesn’t appear to hold up (code below).


file <- "ftp://ftp.cmdl.noaa.gov/ccg/co2/trends/co2_mm_mlo.txt"
xnames <- c("year", "month", "dedcimaldate", "average", "interpolated", "trend_sc", "days_data")

raw_data <- read.table(file, header=FALSE, sep="", skip=72, col.names=xnames)

mintvec <- c( as.numeric((raw_data$year[1])), as.numeric((raw_data$month[1])))
maxtvec <- c( as.numeric((raw_data$year[nrow(raw_data)])), as.numeric((raw_data$month[nrow(raw_data)])))
CO2ts <- ts(raw_data$interpolated, start=(mintvec), end=(maxtvec), frequency=12)

startdate <- as.yearmon(paste(start(CO2ts)[1], start(CO2ts)[2], sep="-"))
enddate <- as.yearmon(paste(end(CO2ts)[1], end(CO2ts)[2], sep="-"))
daterange <- paste(startdate, enddate, sep=" to ")
plot(CO2ts,xlab="Year", ylab= "Atmospheric CO2 (ppm)" ,main=paste("Average Atmospheric CO2 Mauna Loa Observatory", daterange))

fit <- stl(CO2ts, s.window=12, robust=TRUE)

plot(fit, main="Decomposition of Average Atmospheric CO2")

fcast <- forecast(fit, h=120,level=c(50,95), fan=TRUE)
plot(fcast, xlab="Year", ylab= "Atmospheric CO2 (ppm)" ,main=paste("Forecasted Average Atmospheric CO2 Mauna Loa Observatory", daterange))
u <- par("usr")
rect(u[1], u[3], u[2], u[4], col = "gray88", border = "black")
abline(v = pretty(pretty(time(CO2ts))), col='white') ## draw h lines
xrange <- c(range(CO2ts), range(fcast$mean))
xrange <- range(xrange)
abline(h = (pretty(xrange)), col='white') ## draw v lines
plot(fcast, xlab="Year", ylab= "Atmospheric CO2 (ppm)" ,main=paste("Forecasted Average Atmospheric CO2 Mauna Loa Observatory", daterange))

V1 = seq(from=as.yearmon("Mar 1958"), to=as.yearmon("Aug 2013"), by=1/12)
dfrem <- data.frame( yearmon = V1, remainder = fcast$residuals)
plot(dfrem$remainder, dfrem$V1)
abline(lm(dfrem$remainder~dfrem$yearmon), col='black', lty=3, lwd=2) 

Creating a vector of string or numeric from a factor vector using the factors levels: a transformation of sorts


I’ve used this handy bit of code quite a bit. It comes in handy when I want to use something that has been coerced to type=factor. I run into this quite a bit after a table() call when I want to do something more interesting with the result (e.g., convert to a data.frame and sort or select rows by the bin name).

vec <- round(12 + rnorm(100, 5, 7), 0)
t1 <- table(vec)
df1 <- data.frame(t1)
df1$vec <- as.numeric(levels(df1$vec))[df1$vec]

Histogram with density plot overlay (and fancy ggplot-esque background + summary data where legend goes)

testplot_2013_09_13I recently had to visualize some data for a client that involved identifying the number of members that were under the age of 18. Thanks to some help from Robert Kabacoff’s Quick-R site, I put together a histogram with density plot overlay. This is how I did it:

agedata <- round(20 + rnorm(1000, 25, 20),0)

xrange <- "Jan-2013 to Jun-2013"
par(mar=c(5.1, 4.1, 4.1, 11.5))
x1 <- hist( agedata, breaks = 100, xlim=range(0, 100), lwd=2,yaxt='n', xaxt='n', ylab="", xlab="", main="")
hist( agedata, breaks = 100, xlim=range(0, 100), lwd=2,yaxt='n', xaxt='n', ylab="", xlab="", main="")
u <- par("usr")
rect(u[1], u[3], u[2], u[4], col = "gray88", border = "black")
abline(h = pretty(x1$counts), col='white') ## draw h lines
abline(v = (pretty(range(1,100))), col='white') ## draw v lines
hist(agedata, breaks=100, xlim=range(0, 100), ylim=range(u[3:4]), col="yellow", xlab="Age", main=paste("Drug Users by Age fills from", xrange) )
xx1 <- data.frame(table(agedata)) ## I will use this next chunk of code to "color in" only those bars that are below 18years to highlight them
xx1$agedata <- as.numeric(levels(xx1$agedata))[xx1$agedata]
xx1 <- xx1[order(xx1$agedata),]
xx1 <- xx1[xx1$agedata < 18,]
vec <- NULL
for( i in 1:nrow(xx1)){ ## this is where I create my new data
run <- rep(xx1$agedata[i], xx1$Freq[i])
vec <- c(vec, run)
hist(vec, breaks=c(x1$breaks[x1$breaks<18]) ,xlim=range(0, 100), ylim=range(u[3:4]), col="darksalmon", xlab="", main= "" ) # notice I have to use breaks from the previous hist call to ensure that the breaks for the new hist line up with the previous.
abline(v=17, lwd=2, lty=2, col="red" )
count <- nrow(genage)
avage <- round(mean(agedata),2)
medage <- round(median(agedata), 2)
sdage <- round(sd(agedata),2)
# text(xpd=TRUE, x= 35, y=35, offset=-.2, labels=paste("Count of Users:", count, "\n", "Mean Age:", avage, "\n", "Median Age:", medage, "\n", "St. Dev.:", sdage), font=2)
legend(xpd=TRUE,'topright', inset=c(-0.3,0), legend=paste(" Count of Users:", count, "\n", "Mean Age:", avage, "\n", "Median Age:", medage, "\n", "St. Dev.:", sdage), pch=NA, ,title="Summary", bty='n' , cex=1) ## you are going to have to create the strings that go into the legend youself, I used some that I had previously created
d <- density(agedata)
plot(d, main="", xlim=range(0,100), xlab="", ylab="", xaxt='n', yaxt="n")