```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE) ``` "Munging" (rhymes with "lunging") is the data professional's slang for cleaning and reshaping data. Data munging is a broad subject, but here we're talking about __combining__ data. Data combination can take several forms: * Merging -- when you have some identifier(s) like MRN or subject ID, and multiple datasets, like "patient_insurance_info", "patient_vitals", and "patient_meds". * Column binding -- when you have new data that already lines up row-to-row and you need to just slide those new columns into place beside existing columns * Row binding -- when you have new data that already lines up column-to-column and you need to append data to the bottom of an existing dataset Let's take each one in turn. ## Merging Merging data is like a Venn Diagram. Let's take the example of our patient data. We have information about various patients in three tables (data frames in R): * patient_insurance_info * patient_vitals * patient_meds You can have various levels of overlaps -- some patients will exist in only one of these, others in two, others in all three. You can choose which overlap you want. Maybe it looks something like this? ```{r echo = FALSE} library(VennDiagram) # A more complicated diagram venn.plot <- draw.triple.venn( area1 = 40, area2 = 40, area3 = 40, n12 = 30, n23 = 30, n13 = 30, n123 = 20, category = c("patient_insurance_info", "patient_vitals", "patient_meds"), fill = c("light blue", "light green", "light yellow"), lty = "blank", margin = 0.2, cex = 0, cat.cex = 1 ); grid.draw(venn.plot); grid.newpage(); ``` Let's practice. First, I'll generate some fake data fields. Don't worry about this code, unless you want to learn how to generate fake data quickly! ```{r} mrns <- seq(11000, 99000, by=1787) # gives us 50 fake MRNS bp_systolic <- seq(80,190, by=5) # gives us fake systolic bps bp_diastolic_diff <- seq(-50, -10, by = 2) # gives us how much less diastolic will be temp <- seq(97,101, by = 0.1) # gives us fake temps insurance <- c("uninsured", "Aetna", "Medicaid", "IBX", "Cigna") # gives us fake insurance meds <- c("albuterol", "buspirone", "oxycodone", "atomoxetine", "amoxicillin", "ibuprofen", "acetaminophen", "cetirizine", "lansoprazole", "paroxetine", "methylphenidate", "glipizide") # gives us fake meds ``` And now some fake datasets. We have 50 MRNS, so I'll put 40 of each in the three data frames. That ensures a bit of overlap. Again, this is not something you have to understand 100%, I'm just generating some data frames that contain fabricated data. ```{r} set.seed(42) patient_insurance_info <- data.frame(mrn=sample(mrns, 40, replace=FALSE), insurance=sample(insurance,40, replace=TRUE)) patient_vitals <- data.frame(mrn=sample(mrns, 40, replace=FALSE), systolic = sample(bp_systolic,40, replace=TRUE), diastolic = NA_integer_, temp = sample(temp, 40, replace=TRUE)) patient_vitals$diastolic <- patient_vitals$systolic + sample(bp_diastolic_diff, 40, replace = TRUE) patient_meds <- data.frame(mrn=sample(mrns, 40, replace=FALSE), meds = replicate(40, paste0(sample(meds, sample(0:7,1), replace=FALSE), collapse = ", "))) ``` Let's peek: ```{r} head(patient_insurance_info) head(patient_vitals) head(patient_meds) ``` ### Doing the merge To merge data, you need one or more columns that act as the "pivot" or "match" to line up the rest of the data. In our case, we just have one identifying column, the MRN. In other use cases, you might want to line up data on both the MRN and the encounter ID, or something like that. You also need to know what overlap of data you want. Do you want only the patients that have data in all three tables? Or all patients who have any data at all? Or patients that have both vitals and meds, but insurance is optional? We merge one table in at a time, so you can choose the overlap each time. Let's start by merging so that we end up only with the patients that have all three kinds of data. That's the default behavior of `merge`, so we don't have to add any extra fields to indicate that. ```{r} merged_1 <- merge(patient_insurance_info, patient_meds) merged_1 <- merge(merged_1, patient_vitals) ``` How many rows does merged_1 have? What does it look like? ```{r} nrow(merged_1) head(merged_1) ``` What if we wanted everyone, regardless of any missing data? I can set `all=TRUE` to say "hang on to all data, even if there's no overlap". ```{r} merged_2 <- merge(patient_insurance_info, patient_meds, all = TRUE) merged_2 <- merge(merged_2, patient_vitals, all=TRUE) ``` How many rows does merged_2 have? What does it look like? ```{r} nrow(merged_2) head(merged_2) ``` The "NA" here indicates missing data! Finally, what if we want to say that patients have to have meds, but not vitals or insurance, to be included? Here we'll set one dataset to be "x", another to be "y", and we can say that all.x=TRUE (keep all the x's, but only keep y's that also exist in x), or all.y=TRUE (similar idea). ```{r} merged_3 <- merge(x=patient_insurance_info, y=patient_meds, all.y = TRUE) merged_3 <- merge(merged_3, patient_vitals, all=TRUE) ``` How many rows does merged_3 have? What does it look like? ```{r} nrow(merged_3) head(merged_3) ``` ## Column binding Let's take one of our combined datasets, say `merged_1` (26 patients), and column bind another set of columns to it. Let's say we have a dataframe that includes a randomly assigned group, for a clinical trial, as well as some tracking numbers that will be assigned to each patient and included in barcodes on drug and biosamples. We don't have to line this up with any field in the `merged_1` dataset. ```{r} set.seed(42) randomization <- data.frame(group = sample(c("Group 1", "Group 2", "Group 3"), 26, replace = TRUE), drug_tracking_number = sample(c(2453:5213), 26, replace = FALSE), biosample_tracking_number = sample(c(34523412:52341297), 26, replace = FALSE)) head(randomization) ``` We can just line these up, side by side, like this: [A] [B] -> [AB] ```{r} col_bound <- cbind(merged_1, randomization) head(col_bound) ``` ## Row binding What if we're combining data, say, from different groups or contributors (like a multi-site study)? It's easy to just "stack" data! Let's take our `patient_vitals` data frame and row bind it with a `vitals_by_pt` data frame taken from somewhere else (maybe another researcher at another hospital?). Note that the column names have to match! ```{r} vitals_by_pt <- data.frame(mrn = sample(c(200000:3000000),25, replace=FALSE), systolic = sample(bp_systolic, 25, replace=TRUE), diastolic = NA_integer_, temp = sample(temp, 25, replace=TRUE)) vitals_by_pt$diastolic <- vitals_by_pt$systolic + sample(bp_diastolic_diff, 25, replace = TRUE) all_vitals <- rbind(patient_vitals, vitals_by_pt) ``` How many rows does all_vitals have? What does it look like? ```{r} nrow(all_vitals) head(all_vitals) tail(all_vitals) ``` ## Next steps When you're combining data, you're unlikely to have such clean-cut examples of exact column name matches. Data is notoriously messy and you might need to do additional cleaning and preparation prior to combining data and afterwards. This kind of work, too, falls under the "munging" umbrella, and we'll continue working on that in future posts!