I found myself dealing with some exceptionally unruly data that was part of a clinical table. There was something about my predicament that had me wondering why I had not stubbed my toe on such an issue earlier. The elegance of the solution–and the knowledge that this may come in handy again some time–prompted me to pen a quick post.
What I had was a table where there was one column that included multiple lookup values and those values were in a continuous string separated by commas. Such data is a nightmare to use in an analysis, but I can easily see how such a table would be created. This kind of a layout (entering string data in one cell, separated by commas) is intuitive to a clinician typing such a table into an excel spreadsheet.
To be specific, in my case, I was dealing with a table of drug and Jcode/HCPCS data. With such data any one drug may correspond to multiple J/HCPCS codes. The column with Jcodes/HCPCS codes contained cells that looked something like this: “ABC123, XYZ456, etc., etc.”. For my end product, I needed a table where each individual drug could be listed multiple times, but I only needed ONE Jcode or HCPCS code per line.
I found this most excellent post on stackoverflow that led me out of my mess. Below I provide a toy example that illustrates how I implemented the solution myself.
Where the magic happens is in the use of sapply to return the individual lengths of each parsed string and how this vector of lengths is then part of a call to rep() using the vector of drug names. PURE.POETRY. I find it inspiring that just ONE LINE of code has the power to unmangle such a horrible mess. It highlights how much I have yet to learn about how list structures can be manipulated.
v1=c("drugA","drugB", "drugC") v2=c("J1234, S5105", "J8499", "12345, J9999, X5555") df = data.frame(drugs = v1 , hcpcs.jcodes = v2) df codes <- strsplit(df$hcpcs.jcodes, ", ") data.frame(drugs = rep(df$drugs, sapply(codes, length)), jcodes.hcpcs = unlist(codes))