# Shifting fill date to account for overlapping days supply for persistence analysis (looping if statement with vector indexing)

I’ve been haunted by a problem for some time now, and I knew it was only a matter of time before my developing awareness of the power of R data programming would reveal a steps towards a solution.  One troublesom intermediate step in the over-arching problem involves shifting the fill date for medications to account for overlapping fill date + days supply for previous medication fills.  It is an important intermediate step for the estimation of persistence, and I’ve been sitting on the SAS code that describes this intermediate for some time (thanks to Scott Leslie’s SUGI paper found here (http://wuss.org/proceedings08/08WUSS%20Proceedings/papers/anl/anl09.pdf).

As you can see, the SAS code makes use of a loop, and my problem with R was that my comfort for coding loops in R had not evolved to the point that I was able to work out the R-like translation….

My solution utilizes both a looping if statement and vector indexing.  The other important thing to mention is, to get the code to work properly over a complete set of claims data (by subject) you need to wrap the function within a plyr statement (to run the transformation by subject… I suppose you might also be able to wrap the function within a doBy function, I think there may be a transformBy????).  Anyhow, here is the code:

shiftday <- function(FD,DS){

x <- rep(0, length(FD))

VA <- FD

VB <- DS

for (i in 1:length(VA))

if (i == 1)

x[i] <- VA[i]

else {

if ((((x[i-1] + VB[i-1])) – VA[i]) <= 0)

x[i] <- VA[i]

if ((((x[i-1] + VB[i-1])) – VA[i]) > 0)

x[i] <- ((x[i-1] + VB[i-1])) – VA[i] +VA[i]

}

x

}

# Busy week with work travel, but I’ve got a lot on my mind that I want to blog about..

Just wanted to put make some observatons and put a pin into some things:

First, I heard Louise Yamada managing director of  Louise Yamada Technical Research Advisors LLC, speak on Bloomberg today (5/21/2012) about what might be some weakening in equities.  She sees this shortened week as one where there will be a bit of a rally off of last weeks losses, but by the end of the week, she doesn’t anticipate traders will want to hold US equities over the long weekend.  I thought it would be interested to watch and see if what she anticipates comes to fruition.  The main driver appears to be Euro weakening.  On another note, I heard also (same show) that Mohammed El-Erian, sees Greece’s departure from the Euro zone, which (in the short term) might feed into Louise’s prediction.

Now for some things I’ve been tinkering with in R.  I have been making some strides with using loops to handle an interesting task (creating a series of insightful plots based on a pre-determined vector of identifiers).  Within the piece of looping code, I have some interesting things going on with respect to customized x-axis labels and I’m working on customized export/naming of each plot as a png file.

Lastly, I did some reading in Adler’s R Cookbook (the machine learning section), this combined with my recent purchase of Conway and White’s Machine Learning for Hackers (http://www.amazon.com/Machine-Learning-Hackers-Drew-Conway/dp/1449303714), and I’m really building a great deal of interest in fining ways to apply ML techniques (specifically, MARS) to fit models in health care.  Simultaneously I’m still trying to work through the Matloffs, the Art of R Programming text (which still continues to demonstrate it’s worth every penny).

# Automating barplot formation for factors and character class variables in a dataframe

While analyzing data sets, I sometimes stop and wonder how a user-defined function might automate a task that I see myself carrying out frequently.  An example came up recently that involved something that I used to do with every new data set back when I used SPSS (I think it was the “Explore” function in SPSS…): visualize all the categorical data using barplots and the continuous data using histograms.

To that end, I started to wonder if I could use for to loop over an index within a dataframe to carry this out.  So I gave writing the loop a shot and failed miserably; however, I think I had enough of a well-formed attempt to post on stack overflow (http://stackoverflow.com/questions/10643841/simple-barplot-of-variables-in-df-using-a-loop).

My attempt looked like this:

y <- LETTERS[as.integer(rnorm(100, mean=5, sd=1))]

z <- LETTERS[as.integer(rnorm(100, mean=10, sd=1))]

x <- round(rnorm(100, mean=5, sd=2.5),2) data <- as.data.frame(cbind(x,y,z))

A<- do.call(“cbind”, lapply(data, class))

B<- as.vector( A[1,])

C <- grep(“character|factor”, B)

for (i in 1:length(C)) {   x <- C[i]   counti <- table(data[,x])   y <- barplot(counti, main=paste(“Barplot for var”, x))   return(y)}

Here’s hoping that those much smarter than me can come to my rescue…

# Failure to account for the “unknown”/hubris/lack of respect for error variance…It will get you every time, 99.9% of the time:JP Morgan Chase edition.

Recently an acquaintance posted out of frustration something that went like this: “How do you get people to understand that they don’t know what they don’t know…”  It comes back to me now as I heard this AM that Jamie Dimon, the banker that saved Wall Street (http://www.thedailybeast.com/newsweek/2009/09/10/the-banker-who-saved-wall-street.html), announced that his firm suffered a \$2 Billion dollar loss on synthetic derivatives, and could suffer an additional \$1 Billion loss in the next quarter or two (http://www.bloomberg.com/news/2012-05-11/jpmorgan-loses-2-billion-as-mistakes-trounce-hedges.html).

Personally, I don’t see this as an opportunity to vilify Mr. Dimon or the “too big to fail” banks.  I see this more as a cautionary tale about the misadventures that ensue when very smart people (perhaps the smartest on the planet), start to believe the stories other people write about them.  It’s about what happens when leaders substitute healthy skepticism and good judgement for overconfidence and fail to realize that self-serving bias is pouring sweet honey into thier ears(http://www.psychwiki.com/wiki/Self-serving_bias).  Or, maybe it is a tragic story about how unmanageable leviathans, like JP Morgan Chase, are?  Can any one man oversee and be held accountable for all the activity that takes place at a complex corporate giant?  Either way, I suppose the the moral is the same.

# Hadley Wickham, making the world a better place… one package at a time, plyr edition

It doesn’t take long for someone coming to R to realize that one of the jaw-dropping advantages to the R system is the wealth of user-developed packages that help simplify complex tasks and make statistical analysis possible for us mere mortals.  For the most part, from what I can gather, package developers do what they do without regard for any direct remuneration, therefore they deserve an enormous amount of respect and admiration.  My favorite such person this week is Hadley Wickham, the developer of the plyr package (among many others….ggplot2, stringr..  come to mind…), and is also on faculty with the department of statistics at Rice university.  If I could buy Hadley a beer (or other beverage of his choice), I most certainly would (perhaps there is the seed of a startup here?) for it is because of plyr that I was able to make quick work this week of a data munging task that would have otherwise required quite a bit more coding that it eventually did.  First I’ll provide a brief description of my problem, then I’ll outline how I sought my solution and how I was able to use plyr to get me to the Promised Land.

The problem:

I was working with a dataset that contained multiple observations per subject (medication refills), and I had to perform an analysis (by subject) where I took a value associated with the earliest record (by date) for each subject and compare it to the save data element for the latest record (by date) for the same subject.  So, for example, let’s say that for one subject in the data set (with 15 records spanning d1-dn days), I had to take the amount of \$ paid for an item in d1 and compare it to the \$ paid for an item in dn.  A task made simple with the application of the plyr package, which is great for breaking apart a dataset (by a grouping variable(s)) doing something to that data, then recombining the data.  Before I walk through how I used plyr performed my task, below you’ll find the stack overflow post that was the real key that unlocked my personal understanding of how to get plyr to attack my problem:

http://stackoverflow.com/questions/7985154/working-with-the-output-of-tapply-vs-ddply-plyr-in-r-subsets-of-unequal-le

Now my code:

```Memstats <- function(x) {

MinFD       = min(x\$FILL_DATE)

MaxFD       = max(x\$FILL_DATE)

MinFDcop    = x[which.min(x\$FILL_DATE), "COPAY"]

MaxFDcop    = x[which.max(x\$FILL_DATE), "newcopay"]

MinQTY      = x[which.min(x\$FILL_DATE), "QTY_Updated"]

MaxQTY      = x[which.max(x\$FILL_DATE), "QTY_Updated"]

return(as.data.frame(cbind(MinFD, MaxFD, MinFDcop, MaxFDcop, MinQTY, MaxQTY)))

}

library(plyr)

copayfirlastbymem <- ddply(data, 'Unique_Pt_ID', Memstats)

```

An important thing you need to understand (made obvious when examining the first section of code), is that the essence of the utility of plyr is that you feed it functions, which it applies to your chopped up data.  While base functions are nice, you really start to unlock the power of plyr when you feed it user defined functions.  When you do this, you need to also then understand that if your function is going to return >1 value, you need to make sure that it returns as class=data.frame.  You will see also that I use indexing within the function to return the values of interest (corresponding to the minimum and the maximum dates by subject) using the which.min and which.max function.

Now for the punchline: the plyr syntax.  What is really cool about the plyr syntax is that plyr does what it does with just one line of code.  The dd in “ddply” indicates that the input is a dataframe and the output I desire is also a data frame (there are other options here, e.g., dlplyr), I indicate my “source” data frame (“data”), then I indicate what my grouping variable is (in this case ‘Unique_Pt_ID’), and lastly I feed ddply a function (Memstats).  Beautiful, simple…

# A second note on loops… more looping with lists

As I continue to plod through Matloff’s very appropriate (for where I personally happen to be in my journey with R) text (The Art of R Programming: A Tour of Statistical Software Design), I came upon another piece of looping code.  I wanted to take some time to quickly jot down my thoughts and parse my way through each step.

This code does something that I imagine would be an important data munging step for a spam filtering application or any application that does text analysis.  It takes text input (that you need to first convert into a vector of strings) and then loops over said vector to count the occurrences of each word.  It is important to note, again, that I do realize that any experienced and proficient R programmer develops ways to tackle tasks like this without the use of looping constructs.  This is, for me, purely an academic exercise.

First, as always, the code (adapted From Chapter 3 of Matloff’s text):

findwords <- function(tf) {  # read in the words from the file

wl <- list()

for (i in 1:length(tf)) {

wrd <- txt[i]  # ith word in input file

wl[[wrd]] <- c(wl[[wrd]],i)

}

return(wl)

}

Let’s assume we are going to run the function on this string:

txt <- c(“now”, “is”, “the”, “the”, “time”, “or”, “time”, “now”, “is”, “all”)

The real interesting stuff happens, in my opinion, after the line of code with the for loop.  Let’s start at the beginning (wrd <- txt[i]), and do a bit of slow cognitive mastication.  So the for loop above starts with 1, and then loops from 1 to the last indexed item in the text string (length(tf)).  So, for example, during the first iteration wrd will hold the value of the 1st item in the character vector (in txt it wold be “now”).

Holding that thought, we move to the next line of code (wl[[wrd]] <- c(wl[[wrd]],i) ).  This is where the magic happens.  We are telling the loop here to fill our list (wl) that we created above with stuff.  The first interesting thing going on–to the left of the assignment operator “<-“– is that we are creating names for items in the list, and the names for the items are going to come from the text vector we are looping over.  In the case of the first iteration we are holding wrd <- “txt” so the first item in the list will be called “txt”.   The second interesting thing going on–to the right of the assignment operator “<-“– is a concatenation that “builds” a vector adding an item for each iteration of the loop.  What exactly is being added is the index (within txt) of the value being actively held in the loop (in the case of iteration #1 = “now”).  So for iteration #1, the loop creates the list item wl\$now, and fills it with NULL, 1, which ends up creating a single item vector wl\$now <- 1.

The other cool thing that you need to keep in mind is that when the loop encounters a text value that it has already created a list item name for (for example take item #7 in txt, another instance of “now”), it concatenates the vector wl\$now that you already created (in the first iteration of the loop).  So, for example, after the 7th iteration wl\$now <- c(1,7)!

The of course you need to have the loop spit out the list you created wl (return(wl)).

Very cool!

# A novice’s first note on loops in R

As someone who was trained in the point-and-click ways of SPSS, and is very much still new to R data programming, looping concepts in general and more specifically in R have always been something I’ve approached with equal parts fear and anxiety.  That said, I must also add that I have become very much aware that an aversion to looping constructs, and skill at substituting vectorized code for such, is something that any seasoned R programmer develops a proficiency for.  Even still, I feel that it is an important part of my development as a data programmer, that I understand loops and more importantly rid myself of the irrational fear I have of them.

Just in case any others out there may come from the same place as I have, I wanted to take an example loop that I recently came across as I was reading Matloff’s The Art of R Programing (http://www.amazon.com/The-Art-Programming-Statistical-Software/dp/1593273843) and walk through it (if you haven’t yet purchased the text, I would highly recommend you do so).  I will say that Matloff does briefly walk through the code in the text; however, I found that I needed to be spoon fed more directly what the for loop was actually doing.  Lastly, I want to make it clear that I am not a programmer by training, so forgive any improper usage of nomenclature(you have been warned).  First here is the entire piece of code (the dataset can be obtained from the publishers website):

names(aba) <- c(“Gender”,”Length”,”Diameter”,”Height”,”WholeWt”,”ShuckedWt”,”ViscWt”,”ShellWt”,”Rings”)

grps <- list()

for (gen in c(“M”,”F”)) grps[[gen]] <- which(aba[,1]==gen)

abam <- aba[grps\$M,]

abaf <- aba[grps\$F,]

plot(abam\$Length,abam\$Diameter) plot(abaf\$Length,abaf\$Diameter,pch=”x”,new=FALSE)

Let’s just focus on the piece of code that contains the actual for loop:

for (gen in c(“M”,”F”)) grps[[gen]] <- which(aba[,1]==gen)

1) Let’s talk about ( for (gen in c(“M”,”F”))  ).  For a novice like me, I found it pretty cool that R is looping here over a character vector ( c(“M”, “F”)), so essentially what the for loop is doing here is it assigns an object gen and loops gen over 2 values “M” and “F”.  So, spoon feeding this, gen takes on two values over the course of the loop, first “M” then “F”.

2) Next, we are describing what we want the for loop to do, as it is, well, looping.   That is what we see in the next statement.  Here we see that we are giving the loop the list grps and within grps we are asking the loop to fill the list with stuff (that stuff being 2 vectors of indices that will be generated by the which() function).  What is cool about this is that we are taking advantage of the character vector we are looping over to create the names of the vectors in the list, specifically “M”, then “F”.  We use list indexing with the object gen (grps[[gen]]) to create as many items in the list as there are values we are looping over (in this case 2).

3) Lastly, and really sequentially following step 2 above,  the loop fills the list item you just created with stuff: a vector of indices.

VIOLA!  That’s it.

The only thing that I’ll have to add is that after studying the code, I wonder why Matloff just didn’t do it this way:

abaM <- aba[aba\$Gender == “M”] ## and then

abaF <- aba[aba\$Gender == “F”]

avoiding the loop entirely?