“Munging” (rhymes with “lunging”) is the data professional’s slang for cleaning and reshaping data. Data munging is a broad subject, but here we’re talking about combining data.

Data combination can take several forms:

  • Merging – when you have some identifier(s) like MRN or subject ID that are common to multiple datasets, like “patient_insurance_info”, “patient_vitals”, and “patient_meds”.

  • Column binding – when you have new data that already lines up row-to-row and you need to just slide those new columns into place beside existing columns.

  • Row binding – when you have new data that already lines up column-to-column and you need to append data to the bottom of an existing dataset.

Let’s take each one in turn. You can follow along below, or download the complete code.

Merging

Merging data is like a Venn Diagram. Let’s take the example of our patient data. We have information about various patients in three tables (data frames in R):

  • patient_insurance_info
  • patient_vitals
  • patient_meds

You can have various levels of overlaps – some patients will exist in only one of these, others in two, others in all three. You can choose which overlap you want. Maybe it looks something like this?

venn

Let’s practice. First, I’ll generate some fake data fields. Don’t worry about this code, unless you want to learn how to generate fake data quickly!

mrns <- seq(11000, 99000, by=1787)  # gives us 50 fake MRNS
bp_systolic <- seq(80,190, by=5) # gives us fake systolic bps
bp_diastolic_diff <- seq(-50, -10, by = 2) # gives us how much less diastolic will be
temp <- seq(97,101, by = 0.1) # gives us fake temps
insurance <- c("uninsured", "Aetna", "Medicaid", "IBX", "Cigna") # gives us fake insurance
meds <- c("albuterol", "buspirone", "oxycodone", "atomoxetine", "amoxicillin", "ibuprofen", "acetaminophen", "cetirizine",
          "lansoprazole", "paroxetine", "methylphenidate", "glipizide") # gives us fake meds

And now some fake datasets. We have 50 MRNS, so I’ll put 40 of each in the three data frames. That ensures a bit of overlap. Again, this is not something you have to understand 100%, I’m just generating some data frames that contain fabricated data.

set.seed(42)
patient_insurance_info <- data.frame(mrn=sample(mrns, 40, replace=FALSE), 
                                     insurance=sample(insurance,40, replace=TRUE))
patient_vitals <- data.frame(mrn=sample(mrns, 40, replace=FALSE), 
                             systolic = sample(bp_systolic,40, replace=TRUE),
                             diastolic = NA_integer_,
                             temp = sample(temp, 40, replace=TRUE))
patient_vitals$diastolic <- patient_vitals$systolic + sample(bp_diastolic_diff, 40, replace = TRUE)
patient_meds <- data.frame(mrn=sample(mrns, 40, replace=FALSE),
                           meds = replicate(40, paste0(sample(meds, sample(0:7,1), replace=FALSE), collapse = ", ")))

Let’s peek:

head(patient_insurance_info)
##     mrn insurance
## 1 91415     Aetna
## 2 98563  Medicaid
## 3 34231 uninsured
## 4 80693     Cigna
## 5 62823  Medicaid
## 6 52101     Cigna
head(patient_vitals)
##     mrn systolic diastolic  temp
## 1 62823      120       106 100.8
## 2 23509      125        95  99.2
## 3 41379      145       129  99.4
## 4 64610      145       113  97.8
## 5 73545      160       116  99.1
## 6 55675      125        93  97.7
head(patient_meds)
##     mrn                                                          meds
## 1 16361                              albuterol, oxycodone, paroxetine
## 2 59249                                          oxycodone, ibuprofen
## 3 98563                                                              
## 4 27083                                         albuterol, paroxetine
## 5 55675 atomoxetine, amoxicillin, glipizide, albuterol, acetaminophen
## 6 48527                 buspirone, albuterol, amoxicillin, paroxetine

Doing the merge

To merge data, you need one or more columns that act as the “pivot” or “match” to line up the rest of the data. In our case, we just have one identifying column, the MRN. In other use cases, you might want to line up data on both the MRN and the encounter ID, or something like that.

You also need to know what overlap of data you want. Do you want only the patients that have data in all three tables? Or all patients who have any data at all? Or patients that have both vitals and meds, but insurance is optional?

We merge one table in at a time, so you can choose the overlap each time. Let’s start by merging so that we end up only with the patients that have all three kinds of data. That’s the default behavior of merge, so we don’t have to add any extra fields to indicate that.

merged_1 <- merge(patient_insurance_info, patient_meds)
merged_1 <- merge(merged_1, patient_vitals)

How many rows does merged_1 have? What does it look like?

nrow(merged_1)
## [1] 26
head(merged_1)
##     mrn insurance
## 1 11000 uninsured
## 2 16361 uninsured
## 3 18148     Cigna
## 4 21722     Aetna
## 5 27083       IBX
## 6 28870  Medicaid
##                                                                   meds
## 1                                 buspirone, lansoprazole, atomoxetine
## 2                                     albuterol, oxycodone, paroxetine
## 3                          albuterol, glipizide, ibuprofen, paroxetine
## 4                                           methylphenidate, glipizide
## 5                                                albuterol, paroxetine
## 6 lansoprazole, glipizide, paroxetine, albuterol, buspirone, oxycodone
##   systolic diastolic  temp
## 1      145       121  98.6
## 2      190       160  98.2
## 3       80        66  97.4
## 4       85        41 100.5
## 5      155       121  99.9
## 6      185       175  98.8

What if we wanted everyone, regardless of any missing data? I can set all=TRUE to say “hang on to all data, even if there’s no overlap”.

merged_2 <- merge(patient_insurance_info, patient_meds, all = TRUE)
merged_2 <- merge(merged_2, patient_vitals, all=TRUE)

How many rows does merged_2 have? What does it look like?

nrow(merged_2)
## [1] 49
head(merged_2)
##     mrn insurance                                        meds systolic
## 1 11000 uninsured        buspirone, lansoprazole, atomoxetine      145
## 2 14574     Cigna                                                   NA
## 3 16361 uninsured            albuterol, oxycodone, paroxetine      190
## 4 18148     Cigna albuterol, glipizide, ibuprofen, paroxetine       80
## 5 19935       IBX          glipizide, lansoprazole, ibuprofen       NA
## 6 21722     Aetna                  methylphenidate, glipizide       85
##   diastolic  temp
## 1       121  98.6
## 2        NA    NA
## 3       160  98.2
## 4        66  97.4
## 5        NA    NA
## 6        41 100.5

The “NA” here indicates missing data!

Finally, what if we want to say that patients have to have meds, but not vitals or insurance, to be included? Here we’ll set one dataset to be “x”, another to be “y”, and we can say that all.x=TRUE (keep all the x’s, but only keep y’s that also exist in x), or all.y=TRUE (similar idea).

merged_3 <- merge(x=patient_insurance_info, y=patient_meds, all.y = TRUE)
merged_3 <- merge(merged_3, patient_vitals, all=TRUE)

How many rows does merged_3 have? What does it look like?

nrow(merged_3)
## [1] 48
head(merged_3)
##     mrn insurance                                        meds systolic
## 1 11000 uninsured        buspirone, lansoprazole, atomoxetine      145
## 2 14574     Cigna                                                   NA
## 3 16361 uninsured            albuterol, oxycodone, paroxetine      190
## 4 18148     Cigna albuterol, glipizide, ibuprofen, paroxetine       80
## 5 19935       IBX          glipizide, lansoprazole, ibuprofen       NA
## 6 21722     Aetna                  methylphenidate, glipizide       85
##   diastolic  temp
## 1       121  98.6
## 2        NA    NA
## 3       160  98.2
## 4        66  97.4
## 5        NA    NA
## 6        41 100.5

Column binding

Let’s take one of our combined datasets, say merged_1 (26 patients), and column bind another set of columns to it. Let’s say we have a dataframe that includes a randomly assigned group, for a clinical trial, as well as some tracking numbers that will be assigned to each patient and included in barcodes on drug and biosamples. We don’t have to line this up with any field in the merged_1 dataset.

set.seed(42)
randomization <- data.frame(group = sample(c("Group 1", "Group 2", "Group 3"), 26, replace = TRUE),
                            drug_tracking_number = sample(c(2453:5213), 26, replace = FALSE),
                            biosample_tracking_number = sample(c(34523412:52341297), 26, replace = FALSE))
head(randomization)
##     group drug_tracking_number biosample_tracking_number
## 1 Group 3                 3530                  41623579
## 2 Group 3                 4952                  48504978
## 3 Group 1                 3686                  35217177
## 4 Group 3                 4758                  47865362
## 5 Group 2                 4486                  46591053
## 6 Group 2                 4688                  37574980

We can just line these up, side by side, like this:

[A] [B] -> [AB]

col_bound <- cbind(merged_1, randomization)
head(col_bound)
##     mrn insurance
## 1 11000 uninsured
## 2 16361 uninsured
## 3 18148     Cigna
## 4 21722     Aetna
## 5 27083       IBX
## 6 28870  Medicaid
##                                                                   meds
## 1                                 buspirone, lansoprazole, atomoxetine
## 2                                     albuterol, oxycodone, paroxetine
## 3                          albuterol, glipizide, ibuprofen, paroxetine
## 4                                           methylphenidate, glipizide
## 5                                                albuterol, paroxetine
## 6 lansoprazole, glipizide, paroxetine, albuterol, buspirone, oxycodone
##   systolic diastolic  temp   group drug_tracking_number
## 1      145       121  98.6 Group 3                 3530
## 2      190       160  98.2 Group 3                 4952
## 3       80        66  97.4 Group 1                 3686
## 4       85        41 100.5 Group 3                 4758
## 5      155       121  99.9 Group 2                 4486
## 6      185       175  98.8 Group 2                 4688
##   biosample_tracking_number
## 1                  41623579
## 2                  48504978
## 3                  35217177
## 4                  47865362
## 5                  46591053
## 6                  37574980

Row binding

What if we’re combining data, say, from different groups or contributors (like a multi-site study)? It’s easy to just “stack” data!

Let’s take our patient_vitals data frame and row bind it with a vitals_by_pt data frame taken from somewhere else (maybe another researcher at another hospital?). Note that the column names have to match!

vitals_by_pt <- data.frame(mrn = sample(c(200000:3000000),25, replace=FALSE),
                           systolic = sample(bp_systolic, 25, replace=TRUE),
                           diastolic = NA_integer_,
                           temp = sample(temp, 25, replace=TRUE))
vitals_by_pt$diastolic <- vitals_by_pt$systolic + sample(bp_diastolic_diff, 25, replace = TRUE)
all_vitals <- rbind(patient_vitals, vitals_by_pt)

How many rows does all_vitals have? What does it look like?

nrow(all_vitals)
## [1] 65
head(all_vitals)
##     mrn systolic diastolic  temp
## 1 62823      120       106 100.8
## 2 23509      125        95  99.2
## 3 41379      145       129  99.4
## 4 64610      145       113  97.8
## 5 73545      160       116  99.1
## 6 55675      125        93  97.7
tail(all_vitals)
##        mrn systolic diastolic  temp
## 60 1642168      145       111 100.8
## 61 2283114      145       115  98.2
## 62 1933633      160       126  97.6
## 63 1953473      125        79  99.9
## 64  808036      185       169  98.3
## 65  806383      190       164 100.1

Next steps

When you’re combining data, you’re unlikely to have such clean-cut examples of exact column name matches. Data is notoriously messy and you might need to do additional cleaning and preparation prior to combining data and afterwards. This kind of work, too, falls under the “munging” umbrella, and we’ll continue working on that in future posts!