Sheila Braun
4 min read

Tiny Munge

The Problem

We have to solve a problem for the principal investigator of a study: Each individual (a.k.a. “subject” a.k.a. “participant”) in a data file has three or more records, only one of which has a value in the variable randomization_arm. That value needs to populate randomization_arm in the other records for the same subject-slash-participant-slash-individual. How do we do it?

Create a Sample Data Set (in a Tibble, of Course)

Enter the example data into a data set. Normally, this would be the place for a get_data function that reaches out to REDCap and downloads the latest data set. See Joy Payton’s article about how to access and download REDCap data and how to include the download in your own code. For now, we’ll use a code chunk that simply duplicates the example we received from the study team. The code chunk also formats the variables—which is an important step, especially for the variables containing dates.

shhh(library(lubridate)) # `shhh` is a coy of the function
                         # `suppressPackageStartupMessages` that takes less
                         # time to type and is easier to remember.

# Create a duplicate of the sample data set the investigator's team sent us

df <- tibble(Record_id = c(1, 1, 1, 2, 2, 2),
             event = c(1, 2, 3, 1, 2, 3),
             randomization_arm = c(1, NA, NA, 2, NA, NA),
             lab_draw_date = c("4/29/18",
                               "10/6/18",
                               "11/4/18",
                               "5/6/18",
                               "11/15/18",
                               "12/29/18"))

# then make sure variables are recognized as what they are (format them)
df$Record_id <- factor(df$Record_id) # categorical variable
df$event <- factor(df$event) # categorical variable
df$randomization_arm <- factor(df$randomization_arm) # categorical variable
# do NOT run this next line twice. Doing so will make all the values go belly up.
df$lab_draw_date <- mdy(df$lab_draw_date)  # date

# have a look
df
## # A tibble: 6 x 4
##   Record_id event randomization_arm lab_draw_date
##   <fct>     <fct> <fct>             <date>       
## 1 1         1     1                 2018-04-29   
## 2 1         2     <NA>              2018-10-06   
## 3 1         3     <NA>              2018-11-04   
## 4 2         1     2                 2018-05-06   
## 5 2         2     <NA>              2018-11-15   
## 6 2         3     <NA>              2018-12-29

See those NAs? They should be the same as the value in the first line for each individual. The first two should be 1s and the second two should be 2s, and so on through the rest of the file. We don’t have the rest of the file, but whatever we do for these six lines needs to scale to a larger data file with the same format.

Options for Solving the Problem

Our goal is to populate randomization_arm correctly in all the rows where it isn’t populated correctly. Because we are working in R, there are of course a few (or more than a few) different ways to do anything.

The Widening Option

We could make the data wider by having a single subject with, say, three different variables for event (e.g., event_1, event_2, event_3), three different variables for lab_draw_date, and so on. PRO: One record per ID. CON: Uses spread() and gather() functions, which are amazingly opaque to any new user of R.

The Widening Option Followed by the Narrowing Option

Rather than stopping with the data in wide format, we could collapse it back down afterwards into its initial set of variables, but randomization_arm could keep the value it got from widening the data. I’m a little fuzzy about this, but it strikes me as possible. Same pros and cons as previous, with the additional pro that it results in exactly what is being asked for—always a good thing.

The Looping Option

We could run the data through a loop with logic such as, for instance, “If this record has the same Record_id as any other record we have ever seen, and if randomization_arm is NA, then please place the value of the previous record with the same Record_id’s randomization_arm into this particular randomization_arm. PRO: There isn’t one. It’s a crap option that uses a lot of calculation time and space. This is R, not some silly language in which the only way to get things done is to write loops. CON: See PRO.

The Rearranging Option

Make some other variable the “individual.” Right now the individual is Record_id, and perhaps that shouldn’t be the focus of the data set. PRO: Maybe this would force a more reality-matching approach to the data structure. CON: No matter what we do in this vein, some variable will have repeated values. One could go to a lot of trouble and end up with no real advantages.

The Split Data Option

We could create several smaller tables, for instance Events, Dates, Participant_info, and so on. The key variable, used in each table so we could match them back up again, would be Record_id. A statistician can merge tables before making any calculations so that the analysis table would contain only what is needed for the calculation. PRO: This is the best way to store the data, taking up the least amount of space because no duplicates are tolerated by the dataset design. CON: It doesn’t necessarily follow that this is the best way to work with the data. Many statisticians would be annoyed at having to recombine data every five minutes.

The Mutating Option

More efficient than looping, less of an existential crisis than widening or rearranging, and more presentable than splitting, executing this tidyverse option means we would 1) create subgroups for each Record_id, then 2) have each record in the subgroup have matching values for randomization_arm, then 3) ungroup the subgroups. I like this option best and here it is.

shhh(library(dplyr))
shhh(library(zoo))

newdf <- df %>%
      group_by(Record_id) %>%
      arrange(desc(randomization_arm)) %>%
      mutate_each(funs(na.locf), matches("randomization_arm")) %>%
      ungroup %>%
      arrange(Record_id) %>%
      as_tibble
## `mutate_each()` is deprecated.
## Use `mutate_all()`, `mutate_at()` or `mutate_if()` instead.
## To map `funs` over a selection of variables, use `mutate_at()`
newdf
## # A tibble: 6 x 4
##   Record_id event randomization_arm lab_draw_date
##   <fct>     <fct> <fct>             <date>       
## 1 1         1     1                 2018-04-29   
## 2 1         2     1                 2018-10-06   
## 3 1         3     1                 2018-11-04   
## 4 2         1     2                 2018-05-06   
## 5 2         2     2                 2018-11-15   
## 6 2         3     2                 2018-12-29

Despite the fact that mutate_each() has been deprecated, it still works. We safely (for now) ignore the resulting warnings. na.locf is a generic function for replacing each NA with the most recent non-NA prior to it. So the logic for the workhorse line, the one that adds the values we need, is “Change each record by applying the na.locf function to the variable randomization_arm.” The function group_by() makes sure that this is done for each Record_id—which is what we grouped on—rather than for each record as we plow through the whole file as a single group—thus safeguarding us from the mistake of copying a previous value for randomization_arm that perhaps belonged to another Record_id.

No—I Sum Up

Know your priorities (space, time, or readability). Try everything. Pick what you like. Then either comment the heck out of it or, as here, put it in a Markdown document so people (including Future You) know what you did.