Using data.table to make quick work of limiting a dataset by the first observation of a set of values by subject…

Today I ran into an issue where I had to select the first instance of an observation where the counting of a first instance depended upon >1 variable. It was essentially a play on an issue that came up during a previous post (Picking one observation per ‘subject’ based on max (or min)), but in the previous case each instance could be selected by examining one variable.

As an example, lets assume that you’ve got a medical claims data set where there are multiple lines per claim and multiple claims per patient; however, you wanted to retain only the first complete claim line for each unique Patient ID and Claim ID combination.

The solution is implemented using Matthew Dowle’s data.table package. I’m trying to make a concerted effort to replace (where I can) data.tables for data.frame data structures (when I remember to do so). This is mainly to the speed advantage (upon data manipulation, sorting, merging) you can obtain by doing this. Also, I think a mastery of the data.table syntax will lend itself to very efficient and compact code for data manipulation. In particular there are handy functions that are part of this syntax (like the .SD or subset of data.table), that can do nifty things in one line that would otherwise take many lines to accomplish.

library(data.table)
DT = data.table(Patient=c(5:1, 5:1, 5, 5),Claim=c(10:1, 10, 8), data=rnorm(12))
DT # before 
setkey(DT, Patient)          # re-orders table and marks it sorted.
DT # after
tables()              # KEY column reports the key'd columns
key(DT)
keycols = c("Patient","Claim", "data") ## now data are sorted by Patient, Claim then data
setkeyv(DT,keycols) 
DT # after (notice it's easy to see now how Patient 5 has multiple lines for Claim 10, you only want one)
tables()              # KEY column reports the key'd columns
key(DT)
DT1 <- DT[ , .SD[which.min(Claim), ], by=list(Patient, Claim)]  
DT1 # now you only have one row for each unique Claim, Patient combo (its the also the min of the data var by PatientID+ClaimID)

As a rif on this theme I also stumbled on some interesting alternatives that may actually run even faster in this post.

An excellent place for quotes about statistics/data science

I recently had the occasion to apply George E.P. Box’s famous quote “…all models are wrong, but some are useful.” in a work discussion, and it got me thinking about all those other great and insightful quotes that come in handy (E-mails, presentations, etc.,). So I went on a short search and stumbled on this most excellent list from a stackexchange post. I also stumbled on this list (by following a link from R-bloggers).

R trivia question of the day: What is the default axis buffer R places on y- and x-axes in base scatterplots?

When creating more complex scatter plots (especially when building multiple plots on the same set of axes using par(new=T)), I often set the ylim myself to some custom value; however, you will find that even in these cases base plot in R will add a “buffer” to the upper and lower limits of your y-axes. I’ve always wondered just EXACTLY what this buffer was, and after a little bit of searching, I found this very informative (and useful–due to the nifty yaxs=’i’ trick) post. Anyhow, the answer, it seems, is 4%.

You shall reap what you sew… on reproducible research and taking the plunge with S-weave and knitr

After reading a lot about reproducible research with R and kntir (and Sweave/LaTeX and R Markdown), I have finally made the commitment to buckle my chin strap and get after it. Knitting my first “minimal” *.Rnw file was easy (thanks to the help of Riccardo Klinger’s most excellent example). In addition, to get quickly up to speed on the coding conventions of LaTeX, Tobi Oetikers document found here will give you just enough to be dangerous–in a good way).

After producing my first report, I quickly ran up against an issue that proved to be a major stumbling block for me. In my work, I find that I often have to produce reports for multiple clients (from the same db) and in those reports, I’m looping over many variables to produce the same plot (e.g., for example displaying trend of units sold by month over all the products in a clients inventory). I did what I normally do in those cases, produce a toy example of my problem and take it to stack overflow, and without fail, my solution (actually 2) were contributed kindly by Ben and Brian Diggs. The recipients of my virtual beer (or other beverage of their choice award) of the week!

As I had the most success with Brian’s code, I offer it up below! Cheers to you both, Brian and Ben!

##  Example *.Rnw File
\documentclass[10pt]{article}
\usepackage[margin=1.15 in]{geometry}
<<loaddata, echo=FALSE, message=FALSE>>=
subgroup <- df[ df$Hospital == hosp,]
@

\begin{document}
<<setup, echo=FALSE >>=
  opts_chunk$set(fig.path = paste("test", hosp , sep=""))
@

Some infomative text about hospital \Sexpr{hosp}

<<plots, echo=FALSE >>=
  for(ward in unique(subgroup$Ward)){
    subgroup2 <- subgroup[subgroup$Ward == ward,]
    #     subgroup2 <- subgroup2[ order(subgroup2$Month),]
    savename <- paste(hosp, ward)
    plot(subgroup2$Month, subgroup2$Outcomes, type="o", main=paste("Trend plot for", savename))
  }
@
\end{document}

## example R file that processes the *.Rnw
##  make my data
Hospital <- c(rep("A", 20), rep("B", 20))
Ward <- rep(c(rep("ICU", 10), rep("Medicine", 10)), 2)
Month <- rep(seq(1:10), 4)
Outcomes <- rnorm(40, 20, 5)
df <- data.frame(Hospital, Ward, Month, Outcomes)

## knitr loop
library("knitr")
for (hosp in unique(df$Hospital)){
  knit2pdf("testingloops.Rnw", output=paste0('report_', hosp, '.tex'))
}


gdata for importing .xlsx files on 64 bit Win running 64 bit R

While in most instances it is a much better bet to convert excel data into a *.csv file and use read.csv, there are the rare times when it may actually make sense to grab the data directly out of excel.  This is the case (for example) when you have a client who sends you multiple excel files (with multiple sheets per file) that need to be imported.

There are many different packages that do this; however, I kept running into issues as certain packages are only compatible with 32 bit R (or 32 bit Windows), and I run a 64-64 set up.  My solution, the gdata package.  While it is not very fast (esp for large excel files), you can accomplish the above task with gdata’s read.xls function.  In my case the code looks like this:

library(gdata)
DF <- read.xls(xls="~/yourpath/filename.xlsx",sheet=1)
View(DF)

I plan on using a loop (with assign()) to pull in data from 3 sheets in each of many xls files and create the correct names for all of them in R. A word of warning. Each of my sheets contained around 60K rows, and read.xls took quite a bit of time to run an import, so plan accordingly if you are dealing with big files. In my case, system.time() told me that my import of ~65K rows took around 3.5 minutes.

BTW, while this does not warrant a full post, I just recently discovered the function file.choose() for having R create the file path for you (if you happen to be too lazy to type in an actual file name)… or, to put my code above another way, (and if don’t care about looping) you can do something like this:

library(gdata)
DF <- read.xls(xls=file.choose(),sheet=1)
View(DF)

Take it to the limit…. with usr()

Recently I was running a script that produced a time series plot and needed to draw some horizontal reference lines that highlighted when certain programs were implemented. Usually this is a task handled quite efficiently with the abline() function. The problem was that, due to some custom formatting (user generated legend that was outside of plotting area by using xpd=TRUE in par()), when I called abline, the plotting device would draw the line through the entire plotting surface (e.g., the line was not confined to the plotting area).

So what I had to do was use the segment() function to draw my lines. The only problem was that I needed the precise upper and lower coordinates to feed to the segment command. Even though in my case I used a user defined ylim (max and min of y), the plot adds a little bit of a fudge-factor to these upper and lower limits. Enter the usr command. My answer came in the form of this very useful R help posting. The usr option to par() returns a vector of the exact corner limit values of your plotting area (very handy!).

Here is another nice post about the ‘usr’ option to par.

Here’s how it all came together.


##  model of my problem
par(xpd=TRUE)
x <- stats::runif(12); y <- stats::rnorm(12)
i <- order(x, y); x <- x[i]; y <- y[i]
plot(x, y, main = "Notice how the line extends beyond plot margins")
abline(v=0.4)

##  my solution
par(xpd=TRUE)
x <- stats::runif(12); y <- stats::rnorm(12)
i <- order(x, y); x <- x[i]; y <- y[i]
plot(x, y, main = "Notice how the line extends beyond plot margins")
u <- par("usr")
segments(0.4, u[3], 0.4, u[4])

Picking one observation per ‘subject’ based on max (or min)

Today, I came across a post from the ‘What you’re doing is rather desperate’ blog that dealt with a common issue, and something that I deal with on (almost) a daily basis. It is, in fact, so common an issue that I have a script that does all the work for me and it was good diving back in for a refresh of something I wrote quite a bit ago.

N.Saunders posts a much cleaner solution than mine, but mine avoids this issues that can arise when you have non-unique values as maximums (or minimums). Plus my solution avoids the use of the merge() function which, in my experience can sometimes be a memory and time hog. See below for my take on solving his issue.

## First lets create some data (and inject some gremlins)
df.orig <- data.frame(vars = rep(LETTERS[1:5], 2), obs1 = c(1:10), obs2 = c(11:20))
df.orig <- rbind(df.orig, data.frame(vars = 'A', obs1 = '6', obs2 = '15'))  ##  create some ties
df.orig <- rbind(df.orig, data.frame(vars = 'A', obs1 = '6', obs2 = '16'))  ##  more ties


df.orig <- df.orig[order(df.orig$vars, df.orig$obs1, df.orig$obs2),]  ##  my solution requires that you order your data first
row.names(df.orig) <- seq(1,nrow(df.orig))  ##  since the row.names get scrambled by the order() function we need to re-establish some neatness
x1 <- match(df.orig$vars, df.orig$vars)
index <- as.numeric(tapply(row.names(df.orig), x1, FUN=tail, n=1)) ## here's where the magic happens
df.max <- df.orig[index,]