Let’s say you want to look at two measurements on a patient or research participant, to find some relationship with them (like a score on an anxiety measure and triglyceride levels). But some patients have multiple records for one or both instruments or measures. How can you find the right records to use in your research? If you don’t limit which repeated measures you use, you’ll end up with more than one row per patient, which you don’t want.
Maybe you want the two measurements to be as close as possible in time. Or maybe you want the earliest anxiety measure, and whatever triglyceride is nearest it (say, 40 days earlier), even if that interval isn’t the smallest (say the patient later had an anxiety workup and lipids lab on the same day). There are lots of possibilities here, and they all depend on your research purpose.
Let’s consider some fake data I created for this purpose: 500 records of fake math scores and fake language scores.
The ID (between 1-1000) and dates (between Jan 1 2005 and Jan 1 2016, inclusive, were randomly uniformly distributed, while scores follow a triangular distribution with minimum of 25 (indicates full guessing) and max of 100, and a mode of 80. Math and Language scores are rounded.
Takers may have multiple administrations of either or both tests.
Obtain the data from my GitHub:
What’s the duplicate situation in each table? We can use dplyr to group by subject ID, find the how big each group is, and display that in a table by group size.
. 1 2 3 4 329 76 5 1
Lots of duplication here – one subject took the math test four times. Still, most (329) only have one admin.
. 1 2 3 295 86 11
Similar here – lots of dupes. If we were to merge these data frames by ID without limiting by dates, we would have geometric growth. For example, consider subject 697. This subject has 3 administrations each of math and language, and merge will match each possible combination, for 3 X 3 or 9 rows. This usually causes problems with analysis, which assumes a single observation or combination of observations per subject. So, how can we limit the number of rows returned when we combine the math and language scores to just one row per subject?
We can imagine several ways to accomplish this:
“Give me the earliest language score, if any, and the math score closest in time to that language score. If no language score, omit.”
“I don’t care if it’s a first test or a subsequent readministration, just give me the smallest time interval between a language and a math for each subject that has both.”
“Give me the closest math score and the closest language score to this subject’s MRI date, which I’ll give you.”
“Give me the closest math date after language – if math precedes language, I don’t want to see it.”
We have to keep a few things in mind: some instruments (like some psychometric instruments) should only really be administered once, so we will probably want to disregard administrations 2-n, no matter when they occurred. Some instruments, like lab results, have roughly defined “expiration dates” – they reflect a moment in time, and must be selected carefully to make sure you’re measuring what you want, when you want. An expired measurement may be worse than no data at all in this case. Some instruments might need to be administered in order – for example, we might want an instrument that requires deception to be administered before other measures that tend to tip participants off to the purpose of the deceptive instrument. It is very important to understand our precise research goals as well as the specifications of the measurement when working with data that needs to have multiple administrations resolved.
Let’s take a look at some common tactics with date-based resolution!
We begin by making sure that dates are dates:
OPTION 1: Closest pairs
Let’s consider what we would do if we just wanted to get the closest date-matched set of math and language scores – maybe we are doing research into the correlation of scores between the two tests. If subjects have only one of the two, leave them out of the set.
First we merge. R’s merge command by default only takes the intersection (the overlap in the Venn diagram) of two datasets, as defined by their “by” field. Since our two data frames have the same column name for the date of administration, we’ll want to add suffixes to clarify which is which.
Check out the data frame:
subjectID testDate.math mathScore testDate.lang languageScore 1 12 2006-06-24 64 2009-10-30 86 2 20 2008-08-23 72 2015-12-17 62 3 20 2008-08-23 72 2008-10-14 56 4 35 2015-11-25 76 2005-04-04 82 5 45 2005-05-03 71 2010-08-06 53 6 48 2007-03-24 83 2007-01-02 63
pairs data frame consists of rows that each have a math and a lang score, but some subjects have multiple rows. Confirm this by asking dplyr to group, count, and then display in descending order the number of rows for each subject.
# A tibble: 161 x 2 subjectID size <int> <int> 1 697 9 2 119 4 3 216 4 4 329 4 5 726 4 6 878 4 7 985 4 8 262 3 9 485 3 10 623 3 # ... with 151 more rows
Yep, the dratted 697 is there. How will we select the closest pair for each subject?
One easy way is to use dplyr. We’ve already merged all the options, so now we just want to filter pairs such that within each ID, the only row we keep is the one with the smallest time interval. That’s fairly simple. We don’t care about order, so we just look for the smallest absolute value in date difference, or, to say it another way, the difference that is the same as the minimum difference for this ID.
Note that we have to think about ties – say, there’s a math test 200 days prior to and one 200 days after the language test. In our case, it doesn’t matter which one we pick, so we just take the first one, using dplyr’s
slice(). Finally, we’ll ungroup the resultant data frame.
How big is this data?
 161 5
closestPairs only has 161 rows, as compared to 250 for
pairs. Let’s confirm that there are no duplications:
All false. We have a data frame that consists of only the subjects that have both tests, and from those administrations, only the pair that has the smallest interval of time.
OPTION 2: Earliest math and nearest language (if any)
Here we want all subjects that have a math score. We’ll take their first one, then find the nearest language score (if there is one) to that.
Here, we’ll merge more than just the intersection – we’ll also include cases in which math is present but language is absent. In other words, we want ALL the math rows, regardless of whether there’s a corresponding language row. We can’t say the same about language.
Now we’ll do something similar to option 1 above, as far as selecting the smallest absolute value, but we’ll have to include cases where there is no date distance (it’s na), because there isn’t a language date to use to calculate from. We use the vertical pipe (
|) to symbolize OR: is.na(testDate.lang - testDate.math)`.
Again, let’s look at how large our data is.
 411 5
firstMathMaybeLanguage has 411 rows. Let’s confirm that there are no duplicates:
Great, we have data with no duplicated subjects.