I was reviewing a helpful blog post regarding tapply (http://www.sigmafield.org/2009/09/20/r-function-of-the-day-tapply/), when I got hit with a thought: Should I be using tapply in cases where I use summaryBy?
In much of my data manipulation and munging, I find that I actually use the summaryBy function of the doBy package quite a bit. One of the downsides to summaryBy, that I find, is that it can be quite slow. So, coming back to my thought on tapply. What I have come to realize is that the closer you get to sticking to base functions, the faster your code will run (generally). So, knowing that tapply is a base R function, I decided to run a speed test on a dataframe that has roughly 1.6MM observations in it to see what the performance differnece might be. Below are my results:
system.time(tapply(elig$Eligibility.Date, elig[c("Eligibility.Date", "form")], FUN=length))
user system elapsed
45.34 0.00 45.39
system.time(summaryBy(~Eligibility.Date+form, data=elig, FUN=length))
user system elapsed
151.97 0.03 152.33
As you can see tapply runs over 3x as fast! In additon, the format of the output is more satisfactory to my particular application of these data. So, bottom line, try to stick with base functions when possible, and before going to summaryBy, try tapply.