Evaluating data by group and in rows (up/down vs. side/side) without looping: lapply is your friend

JUST-SAY-NO-TO-LOOPSI recently had to deal with a “switch analysis” that involved identifying observations (within subject) where there was a change in the kind of medication a patient was taking by class.  To use a simple non-drug example, lets say you were interested in a purchasing habits, and you wanted to identify the exact transaction at which some one switched consumption habits (e.g., stopped buying blue widgets and started buying red ones). Things get even more interesting when you are really more interested in switching within sub-groups of widgets (e.g., within all red widgets, when did the consumer switch from red-small widgets to red-large widgets). While this sounds like a trivial task, if your transactions are laid out in columns, my data set had my transactions laid out in rows (each row corresponding to a separate transaction. This required evaluating data across rows (a slightly more complex task).

I didn’t want to deal with a looping construct and the time disadvantage, so I had to find a way to vectorize the solution, and I did so using lapply. Then after putting something useful into a list, I used some code to recreate a dataframe from the lapply output (see here for a great primer on the apply set of functions). I’m sure that there is a more elegant solution to doing the last bit, and I’m open to any suggestions.

In this example I leave the steps pretty well laid out, and I have not done much to clean things up, to make things a bit more easy to follow.

Note that in this dummy example I use GPI4 (high level classifier) and GPI10 (lower level classifier), and my intent is to isolate when a subject switches to a different GPI10-level agent within a GPI4. These are remnants of my particular use-case, as they are drug classifiers.:


# make my data
structure(list(ptid = c(101L, 101L, 101L, 101L, 101L, 102L, 102L,
102L, 102L, 103L, 103L, 103L, 103L), gpi4 = structure(c(2L, 2L,
2L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L), .Label = c("", "A",
"B"), class = "factor"), gpi10 = c(1L, 1L, 2L, 1L, 2L, 1L, 1L,
2L, 2L, 2L, 1L, 1L, 2L), filldate = structure(c(2L, 3L, 4L, 5L,
6L, 2L, 3L, 4L, 5L, 2L, 3L, 4L, 5L), .Label = c("", "1/1/2012",
"2/1/2012", "3/1/2012", "4/1/2012", "5/1/2012"), class = "factor")), .Names = c("ptid",
"gpi4", "gpi10", "filldate"), row.names = c(NA, 13L), class = "data.frame")

# create a splitting factor
ddata$splitfac <- factor(paste(data$ptid, data$gpi4, sep="")) # create a factor to split-by
data <- data[ order(data$splitfac, data$filldate),]
test <- lapply(split(data, factor(data$splitfac)), function(x) (x <- c(NA,x$gpi10[-nrow(x)]))) # here is where the magic happens. Lapply essentially creates a vector 'dragging down' the previous observations data.
test2 <- (unlist(test)) # This and the next few lines of code deal with re-creating a dataframe
# names(test2)
test3 <- data.frame(splitfac2 = substr(names(test2), 1, nchar(names(test2))-1), gpi10.m1 = test2)
data <- cbind(data, test3)
index <- which(data$gpi10 != data$gpi10.m1) # this gives the indices of the switches
data[index, ] # these are the swithces