Using data.table to make quick work of limiting a dataset by the first observation of a set of values by subject…

meep meep this!!!!!!

meep meep this!!!!!!

Today I ran into an issue where I had to select the first instance of an observation where the counting of a first instance depended upon >1 variable. It was essentially a play on an issue that came up during a previous post (Picking one observation per ‘subject’ based on max (or min)), but in the previous case each instance could be selected by examining one variable.

As an example, lets assume that you’ve got a medical claims data set where there are multiple lines per claim and multiple claims per patient; however, you wanted to retain only the first complete claim line for each unique Patient ID and Claim ID combination.

The solution is implemented using Matthew Dowle’s data.table package. I’m trying to make a concerted effort to replace (where I can) data.tables for data.frame data structures (when I remember to do so). This is mainly to the speed advantage (upon data manipulation, sorting, merging) you can obtain by doing this. Also, I think a mastery of the data.table syntax will lend itself to very efficient and compact code for data manipulation. In particular there are handy functions that are part of this syntax (like the .SD or subset of data.table), that can do nifty things in one line that would otherwise take many lines to accomplish.

DT = data.table(Patient=c(5:1, 5:1, 5, 5),Claim=c(10:1, 10, 8), data=rnorm(12))
DT # before 
setkey(DT, Patient)          # re-orders table and marks it sorted.
DT # after
tables()              # KEY column reports the key'd columns
keycols = c("Patient","Claim", "data") ## now data are sorted by Patient, Claim then data
DT # after (notice it's easy to see now how Patient 5 has multiple lines for Claim 10, you only want one)
tables()              # KEY column reports the key'd columns
DT1 <- DT[ , .SD[which.min(Claim), ], by=list(Patient, Claim)]  
DT1 # now you only have one row for each unique Claim, Patient combo (its the also the min of the data var by PatientID+ClaimID)

As a rif on this theme I also stumbled on some interesting alternatives that may actually run even faster in this post.


An excellent place for quotes about statistics/data science

This is how I picked up my wife... guaranteed to work 60% of the time every time!

This is how I picked up my wife… guaranteed to work 60% of the time every time!

I recently had the occasion to apply George E.P. Box’s famous quote “…all models are wrong, but some are useful.” in a work discussion, and it got me thinking about all those other great and insightful quotes that come in handy (E-mails, presentations, etc.,). So I went on a short search and stumbled on this most excellent list from a stackexchange post. I also stumbled on this list (by following a link from R-bloggers).