Cleaning up unruly data – flattening a delimited column of composite string data

While there may be more  efficient ways of eating watermelon, there are none quite as entertaining...

While there may be more efficient ways of eating watermelon, there are none quite as entertaining…

I found myself dealing with some exceptionally unruly data that was part of a clinical table. There was something about my predicament that had me wondering why I had not stubbed my toe on such an issue earlier. The elegance of the solution–and the knowledge that this may come in handy again some time–prompted me to pen a quick post.

What I had was a table where there was one column that included multiple lookup values and those values were in a continuous string separated by commas. Such data is a nightmare to use in an analysis, but I can easily see how such a table would be created. This kind of a layout (entering string data in one cell, separated by commas) is intuitive to a clinician typing such a table into an excel spreadsheet.

To be specific, in my case, I was dealing with a table of drug and Jcode/HCPCS data where any one drug could correspond to multiple J/HCPCS codes. The column where the Jcodes/HCPCS codes resided contained cells that looked something like this: “ABC123, XYZ456, etc., etc.”. For my end product, I needed a table where each J/HCPCS code lived in it’s own cell next to the individual drug names (which could be listed multiple times).

I found this most excellent post on stackoverflow that led me out of my mess. Below I provide a toy example that illustrates how I implemented the solution myself.

Where the magic happens is in the use of sapply to return the individual lengths of each parsed string and how this vector of lengths is then part of a call to rep() using the vector of drug names. PURE.POETRY. I find it inspiring that just ONE LINE of code has the power to unmangle such a horrible mess. It highlights how useful list structures are and how they can be manipulated.

v1=c("drugA","drugB", "drugC")
v2=c("J1234, S5105", "J8499", "12345, J9999, X5555")
df = data.frame(drugs = v1 , hcpcs.jcodes = v2)

codes <- strsplit(df$hcpcs.jcodes, &quot;, &quot;)
data.frame(drugs = rep(df$drugs, sapply(codes, length)), jcodes.hcpcs = unlist(codes))#... magic happens here!