Using data.table to make quick work of limiting a dataset by the first observation of a set of values by subject…

meep meep this!!!!!!

meep meep this!!!!!!

Today I ran into an issue where I had to select the first instance of an observation where the counting of a first instance depended upon >1 variable. It was essentially a play on an issue that came up during a previous post (Picking one observation per ‘subject’ based on max (or min)), but in the previous case each instance could be selected by examining one variable.

As an example, lets assume that you’ve got a medical claims data set where there are multiple lines per claim and multiple claims per patient; however, you wanted to retain only the first complete claim line for each unique Patient ID and Claim ID combination.

The solution is implemented using Matthew Dowle’s data.table package. I’m trying to make a concerted effort to replace (where I can) data.tables for data.frame data structures (when I remember to do so). This is mainly to the speed advantage (upon data manipulation, sorting, merging) you can obtain by doing this. Also, I think a mastery of the data.table syntax will lend itself to very efficient and compact code for data manipulation. In particular there are handy functions that are part of this syntax (like the .SD or subset of data.table), that can do nifty things in one line that would otherwise take many lines to accomplish.

DT = data.table(Patient=c(5:1, 5:1, 5, 5),Claim=c(10:1, 10, 8), data=rnorm(12))
DT # before 
setkey(DT, Patient)          # re-orders table and marks it sorted.
DT # after
tables()              # KEY column reports the key'd columns
keycols = c("Patient","Claim", "data") ## now data are sorted by Patient, Claim then data
DT # after (notice it's easy to see now how Patient 5 has multiple lines for Claim 10, you only want one)
tables()              # KEY column reports the key'd columns
DT1 <- DT[ , .SD[which.min(Claim), ], by=list(Patient, Claim)]  
DT1 # now you only have one row for each unique Claim, Patient combo (its the also the min of the data var by PatientID+ClaimID)

As a rif on this theme I also stumbled on some interesting alternatives that may actually run even faster in this post.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s