failuretoconverge

father, husband, son, brother, data hacker, seeking optimization

Picking one observation per ‘subject’ based on max (or min)…there can be only one!

Posted on February 14, 2013 by Healthoutcomesguy

Today, I came across a post from the ‘What you’re doing is rather desperate’ blog that dealt with a common issue, and something that I deal with on (almost) a daily basis. It is, in fact, so common an issue that I have a script that does all the work for me and it was good diving back in for a refresh of something I wrote quite a bit ago.

N.Saunders posts a much cleaner solution than mine, but mine avoids this issues that can arise when you have non-unique values as maximums (or minimums). Plus my solution avoids the use of the merge() function which, in my experience can sometimes be a memory and time hog. See below for my take on solving his issue.

## First lets create some data (and inject some gremlins)
df.orig <- data.frame(vars = rep(LETTERS[1:5], 2), obs1 = c(1:10), obs2 = c(11:20))
df.orig <- rbind(df.orig, data.frame(vars = 'A', obs1 = '6', obs2 = '15'))  ##  create some ties
df.orig <- rbind(df.orig, data.frame(vars = 'A', obs1 = '6', obs2 = '16'))  ##  more ties

df.orig <- df.orig[order(df.orig$vars, df.orig$obs1, df.orig$obs2),]  ##  my solution requires that you order your data first
row.names(df.orig) <- seq(1,nrow(df.orig))  ##  since the row.names get scrambled by the order() function we need to re-establish some neatness
x1 <- match(df.orig$vars, df.orig$vars)
index <- as.numeric(tapply(row.names(df.orig), x1, FUN=tail, n=1)) ## here's where the magic happens
df.max <- df.orig[index,]

1 thought on “Picking one observation per ‘subject’ based on max (or min)…there can be only one!”

Pingback: Using data.table to make quick work of limiting a dataset by the first observation of a value by subject… | failuretoconverge

Leave a comment Cancel reply