It doesn’t take long for someone coming to R to realize that one of the jaw-dropping advantages to the R system is the wealth of user-developed packages that help simplify complex tasks and make statistical analysis possible for us mere mortals. For the most part, from what I can gather, package developers do what they do without regard for any direct remuneration, therefore they deserve an enormous amount of respect and admiration. My favorite such person this week is Hadley Wickham, the developer of the plyr package (among many others….ggplot2, stringr.. come to mind…), and is also on faculty with the department of statistics at Rice university. If I could buy Hadley a beer (or other beverage of his choice), I most certainly would (perhaps there is the seed of a startup here?) for it is because of plyr that I was able to make quick work this week of a data munging task that would have otherwise required quite a bit more coding that it eventually did. First I’ll provide a brief description of my problem, then I’ll outline how I sought my solution and how I was able to use plyr to get me to the Promised Land.
The problem:
I was working with a dataset that contained multiple observations per subject (medication refills), and I had to perform an analysis (by subject) where I took a value associated with the earliest record (by date) for each subject and compare it to the save data element for the latest record (by date) for the same subject. So, for example, let’s say that for one subject in the data set (with 15 records spanning d1-dn days), I had to take the amount of $ paid for an item in d1 and compare it to the $ paid for an item in dn. A task made simple with the application of the plyr package, which is great for breaking apart a dataset (by a grouping variable(s)) doing something to that data, then recombining the data. Before I walk through how I used plyr performed my task, below you’ll find the stack overflow post that was the real key that unlocked my personal understanding of how to get plyr to attack my problem:
Now my code:
Memstats <- function(x) { MinFD = min(x$FILL_DATE) MaxFD = max(x$FILL_DATE) MinFDcop = x[which.min(x$FILL_DATE), "COPAY"] MaxFDcop = x[which.max(x$FILL_DATE), "newcopay"] MinQTY = x[which.min(x$FILL_DATE), "QTY_Updated"] MaxQTY = x[which.max(x$FILL_DATE), "QTY_Updated"] return(as.data.frame(cbind(MinFD, MaxFD, MinFDcop, MaxFDcop, MinQTY, MaxQTY))) } library(plyr) copayfirlastbymem <- ddply(data, 'Unique_Pt_ID', Memstats)
An important thing you need to understand (made obvious when examining the first section of code), is that the essence of the utility of plyr is that you feed it functions, which it applies to your chopped up data. While base functions are nice, you really start to unlock the power of plyr when you feed it user defined functions. When you do this, you need to also then understand that if your function is going to return >1 value, you need to make sure that it returns as class=data.frame. You will see also that I use indexing within the function to return the values of interest (corresponding to the minimum and the maximum dates by subject) using the which.min and which.max function.
Now for the punchline: the plyr syntax. What is really cool about the plyr syntax is that plyr does what it does with just one line of code. The dd in “ddply” indicates that the input is a dataframe and the output I desire is also a data frame (there are other options here, e.g., dlplyr), I indicate my “source” data frame (“data”), then I indicate what my grouping variable is (in this case ‘Unique_Pt_ID’), and lastly I feed ddply a function (Memstats). Beautiful, simple…
Thanks Hadley!