Why you should try to use tapply, vs. summaryBy

I was reviewing a helpful blog post regarding tapply (http://www.sigmafield.org/2009/09/20/r-function-of-the-day-tapply/), when I got hit with a thought: Should I be using tapply in cases where I use summaryBy?

In much of my data manipulation and munging, I find that I actually use the summaryBy function of the doBy package quite a bit. One of the downsides to summaryBy, that I find, is that it can be quite slow. So, coming back to my thought on tapply. What I have come to realize is that the closer you get to sticking to base functions, the faster your code will run (generally). So, knowing that tapply is a base R function, I decided to run a speed test on a dataframe that has roughly 1.6MM observations in it to see what the performance differnece might be. Below are my results:

system.time(tapply(elig$Eligibility.Date, elig[c("Eligibility.Date", "form")], FUN=length))
user system elapsed
45.34 0.00 45.39
system.time(summaryBy(~Eligibility.Date+form, data=elig, FUN=length))
user system elapsed
151.97 0.03 152.33

As you can see tapply runs over 3x as fast! In additon, the format of the output is more satisfactory to my particular application of these data. So, bottom line, try to stick with base functions when possible, and before going to summaryBy, try tapply.


Some links to useful plot() tips

I’ve been working with some basic plots to visualize some time series data in one of my projects.  I’ve always had a hard time recalling exactly how to take things beyond the simple plot default settings so here are some useful links that I’ve found that were quite helpful.

First, the tried and true Quick-R site, for creating a y-axis on the right side of your plot (using side=):


Next this R-bloggers post on creating a multiple y-axes:


Then there is this link from Revolutions that contains some very useful information about resolution for output:


Lastly, synthesizing information from the r-bloggers link I have found VERY useful the use of the pretty function to help customize the tick marks on any axes (with the help also of ylim).  I’ll cut and paste some code directly, and hopefully you can get the gist (understanding that the code snippet is very much out of context):

par(mar=c(5, 5, 4, 4) + 0.1)

ylimit = c(0, (max(eventbymon$eventcount.sum))+0.25*(max(eventbymon$eventcount.sum)))

plot(eventbymon$Eligibility.Date, eventbymon$eventcount.sum, type="l",  yaxt="n", ylab="", lty=3, ylim = ylimit, xlab="Month", cex.lab=1) axis(4, <strong>at=pretty(ylimit))</strong> mtext("Events(n)", side=4, line=3, cex=1,las=0)

title(main=paste(pers,"Events by Month for", coh, "for",year), cex.main=1.1)


ylimit = c(min(eligbymon$Eligibility.Date.length)/1.05, (max(eligbymon$Eligibility.Date.length))) plot(eligbymon$Eligibility.Date, eligbymon$Eligibility.Date.length, type="l", ylim=ylimit, ylab="Eligibility (n)", xlab="", cex.lab=1)

legend("bottomright",legend=c("Eligibility","Events"),lty=c(1,3), cex=0.85, inset=.025)