Why you should try to use tapply, vs. summaryBy

I was reviewing a helpful blog post regarding tapply (http://www.sigmafield.org/2009/09/20/r-function-of-the-day-tapply/), when I got hit with a thought: Should I be using tapply in cases where I use summaryBy?

In much of my data manipulation and munging, I find that I actually use the summaryBy function of the doBy package quite a bit. One of the downsides to summaryBy, that I find, is that it can be quite slow. So, coming back to my thought on tapply. What I have come to realize is that the closer you get to sticking to base functions, the faster your code will run (generally). So, knowing that tapply is a base R function, I decided to run a speed test on a dataframe that has roughly 1.6MM observations in it to see what the performance differnece might be. Below are my results:

system.time(tapply(elig$Eligibility.Date, elig[c("Eligibility.Date", "form")], FUN=length))
user system elapsed
45.34 0.00 45.39
system.time(summaryBy(~Eligibility.Date+form, data=elig, FUN=length))
user system elapsed
151.97 0.03 152.33

As you can see tapply runs over 3x as fast! In additon, the format of the output is more satisfactory to my particular application of these data. So, bottom line, try to stick with base functions when possible, and before going to summaryBy, try tapply.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s