Evaluating across many columns within the same row to extract the first occurrence of a string

Does your job make you feel like this sometimes?? Yup, me too!

Does your job make you feel like this sometimes?? Yup, me too!

Some time ago I was tasked with looking at a set of claims data to pull only claim lines that had a certain diagnosis, and even more importantly, if there were claims where multiple diagnoses were listed, I needed to pull only the “most important” (let’s just call it that). It is not the purpose of this post to debate the consideration of attributing a weighting to the importance of a diagnoses based on the order of appearance in the claim files after the first position, so let’s not even go there. This is just more about applying a method that may have application in a more broader context.

For those of you who may not be familiar, the claim files that I had listed (for each Patient and Service Date) a series of fields indicating different diagnoses (e.g., Principal Diagnosis, Diagnosis Code 2, Diagnosis Code 3). The condition that I was tasked with looking into was made up of a composite of diagnosis codes, and in some cases a patient may have one of the diagnoses codes (making up the composite) in position #1, and another in position #2 (or even position #3). However, we did not want to count patients more than once for each claim line. Rather, we wanted to pull forward the first diagnosis (matching any in the composite set) as we “looked” at the set of diagnoses code columns, working our way down Diagnosis Code Principal, then Diagnosis code #2, and #3.

Enough of the background, it is better to just show you what is going on with some code…

One last note, I used a looping construct in this version of my solution. Recently, I think I’ve stumbled on a way to do the same with one of the apply family of functions.

set.seed(1234)
options(stringsAsFactors=FALSE)

# simulate some data
Diagnosis_Code_Principal <- sample(c("heart attack", "unstable angina", "broken arm", "stomach ache", "stubbed toe", "ingrown hair", "tooth ache"), 1000, replace=TRUE)
Diagnosis_Code_02 <- sample(c("heart attack", "unstable angina", "broken arm", "stomach ache", "stubbed toe", "ingrown hair", "tooth ache"), 1000, replace=TRUE)
Diagnosis_Code_03 <- sample(c("heart attack", "unstable angina", "broken arm", "stomach ache", "stubbed toe", "ingrown hair", "tooth ache"), 1000, replace=TRUE)
Service_Date_MMDDYYYY <- sample( seq.Date(as.Date("2011/1/1"), as.Date("2011/12/31"), by="day"), 1000, replace=TRUE)
Person_ID <- sample( c("Person1", "Person2", "Person3", "Person4"),1000, replace=TRUE)
eventdata <- data.frame(Person_ID, Service_Date_MMDDYYYY,Diagnosis_Code_Principal, Diagnosis_Code_02, Diagnosis_Code_03)
eventdata <- eventdata[ order(eventdata$Person_ID, eventdata$Service_Date_MMDDYYYY), ]

keydx <- c("heart attack|unstable angina") # these are the "codes" that make up our composite 

eventdata <- unique(eventdata)## duplicate lines are uninformative

eventdata$keep <- grepl((keydx), eventdata$Diagnosis_Code_Principal) | grepl((keydx), eventdata$Diagnosis_Code_02) | grepl((keydx), eventdata$Diagnosis_Code_03) ## more slimming down of the data
eventdata <- eventdata[ eventdata$keep == TRUE, ] 


##  creates a vector where the first match is returned from a set of columns string
firstdx <- rep(0, nrow(eventdata))
for(i in 1:nrow(eventdata)){
  a <- rep(0,3)
  a <- c(eventdata[i, "Diagnosis_Code_Principal"], eventdata[i, "Diagnosis_Code_02"], eventdata[i, "Diagnosis_Code_03"])## you can list as many columns as you like.
  if (any(grepl(keydx, a)))
    b <- a[grep(keydx, a)[1]] # keydx is listing of ICD9s that match dx grouping you are interested in
  if (!any(grepl(keydx, a))) ##  takes care of any cases where there is no match
    b <- NA      
  firstdx[i] <- b
}
eventdata$firstdx <- firstdx ## add the firstdx column to the dataset


Leave a comment