We have to solve a problem for the principal investigator of a study: Each individual (a.k.a. “subject” a.k.a. “participant”) in a data file has three or more records, only one of which has a value in the variable
randomization_arm. That value needs to populate
randomization_arm in the other records for the same subject-slash-participant-slash-individual. How do we do it?
Create a Sample Data Set (in a Tibble, of Course)
Enter the example data into a data set. Normally, this would be the place for a
get_data function that reaches out to REDCap and downloads the latest data set. See Joy Payton’s article about how to access and download REDCap data and how to include the download in your own code. For now, we’ll use a code chunk that simply duplicates the example we received from the study team. The code chunk also formats the variables—which is an important step, especially for the variables containing dates.
shhh(library(lubridate)) # `shhh` is a coy of the function # `suppressPackageStartupMessages` that takes less # time to type and is easier to remember. # Create a duplicate of the sample data set the investigator's team sent us df <- tibble(Record_id = c(1, 1, 1, 2, 2, 2), event = c(1, 2, 3, 1, 2, 3), randomization_arm = c(1, NA, NA, 2, NA, NA), lab_draw_date = c("4/29/18", "10/6/18", "11/4/18", "5/6/18", "11/15/18", "12/29/18")) # then make sure variables are recognized as what they are (format them) df$Record_id <- factor(df$Record_id) # categorical variable df$event <- factor(df$event) # categorical variable df$randomization_arm <- factor(df$randomization_arm) # categorical variable # do NOT run this next line twice. Doing so will make all the values go belly up. df$lab_draw_date <- mdy(df$lab_draw_date) # date # have a look df
## # A tibble: 6 x 4 ## Record_id event randomization_arm lab_draw_date ## <fct> <fct> <fct> <date> ## 1 1 1 1 2018-04-29 ## 2 1 2 <NA> 2018-10-06 ## 3 1 3 <NA> 2018-11-04 ## 4 2 1 2 2018-05-06 ## 5 2 2 <NA> 2018-11-15 ## 6 2 3 <NA> 2018-12-29
NAs? They should be the same as the value in the first line for each individual. The first two should be
1s and the second two should be
2s, and so on through the rest of the file. We don’t have the rest of the file, but whatever we do for these six lines needs to scale to a larger data file with the same format.
Options for Solving the Problem
Our goal is to populate
randomization_arm correctly in all the rows where it isn’t populated correctly. Because we are working in R, there are of course a few (or more than a few) different ways to do anything.
The Widening Option
We could make the data wider by having a single subject with, say, three different variables for event (e.g.,
event_3), three different variables for
lab_draw_date, and so on. PRO: One record per ID. CON: Uses
gather() functions, which are amazingly opaque to any new user of R.
The Widening Option Followed by the Narrowing Option
Rather than stopping with the data in wide format, we could collapse it back down afterwards into its initial set of variables, but
randomization_arm could keep the value it got from widening the data. I’m a little fuzzy about this, but it strikes me as possible. Same pros and cons as previous, with the additional pro that it results in exactly what is being asked for—always a good thing.
The Looping Option
We could run the data through a loop with logic such as, for instance, “If this record has the same
Record_id as any other record we have ever seen, and if
NA, then please place the value of the previous record with the same
randomization_arm into this particular
randomization_arm. PRO: There isn’t one. It’s a crap option that uses a lot of calculation time and space. This is R, not some silly language in which the only way to get things done is to write loops. CON: See PRO.
The Rearranging Option
Make some other variable the “individual.” Right now the individual is
Record_id, and perhaps that shouldn’t be the focus of the data set. PRO: Maybe this would force a more reality-matching approach to the data structure. CON: No matter what we do in this vein, some variable will have repeated values. One could go to a lot of trouble and end up with no real advantages.
The Split Data Option
We could create several smaller tables, for instance
Participant_info, and so on. The key variable, used in each table so we could match them back up again, would be
Record_id. A statistician can merge tables before making any calculations so that the analysis table would contain only what is needed for the calculation. PRO: This is the best way to store the data, taking up the least amount of space because no duplicates are tolerated by the dataset design. CON: It doesn’t necessarily follow that this is the best way to work with the data. Many statisticians would be annoyed at having to recombine data every five minutes.
The Mutating Option
More efficient than looping, less of an existential crisis than widening or rearranging, and more presentable than splitting, executing this
tidyverse option means we would 1) create subgroups for each
Record_id, then 2) have each record in the subgroup have matching values for
randomization_arm, then 3) ungroup the subgroups. I like this option best and here it is.
shhh(library(dplyr)) shhh(library(zoo)) newdf <- df %>% group_by(Record_id) %>% arrange(desc(randomization_arm)) %>% mutate_each(funs(na.locf), matches("randomization_arm")) %>% ungroup %>% arrange(Record_id) %>% as_tibble
## `mutate_each()` is deprecated. ## Use `mutate_all()`, `mutate_at()` or `mutate_if()` instead. ## To map `funs` over a selection of variables, use `mutate_at()`
## # A tibble: 6 x 4 ## Record_id event randomization_arm lab_draw_date ## <fct> <fct> <fct> <date> ## 1 1 1 1 2018-04-29 ## 2 1 2 1 2018-10-06 ## 3 1 3 1 2018-11-04 ## 4 2 1 2 2018-05-06 ## 5 2 2 2 2018-11-15 ## 6 2 3 2 2018-12-29
Despite the fact that
mutate_each() has been deprecated, it still works. We safely (for now) ignore the resulting warnings.
na.locf is a generic function for replacing each
NA with the most recent non-
NA prior to it. So the logic for the workhorse line, the one that adds the values we need, is “Change each record by applying the
na.locf function to the variable
randomization_arm.” The function
group_by() makes sure that this is done for each
Record_id—which is what we grouped on—rather than for each record as we plow through the whole file as a single group—thus safeguarding us from the mistake of copying a previous value for
randomization_arm that perhaps belonged to another
No—I Sum Up
Know your priorities (space, time, or readability). Try everything. Pick what you like. Then either comment the heck out of it or, as here, put it in a Markdown document so people (including Future You) know what you did.