Let’s say you think there’s a correlation between two measures that you’re researching: the ABCD measure, which is a biological measure, and the XYZ phenotype, which is a measured by a questionnaire filled out by a reporter (parents or teachers). You ask your local data guru to wrangle various data sources and give you a combined dataset which includes the ABCD and XYZ scores, along with IQ, sex, race, lead exposure data, and a few other factors that you want to use for matching or blocking or just reporting to NIH. You expect to get data like this. Maybe a couple of missing data points, sure, and some subjects missing too, perhaps, but nothing terrible:

Subject ID sex race IQ lead ABCD XYZ
001 M white     0.9 62
003 M black 105   0.75 80
004 F black 90 pos 1.12 40
005 M white 85   0.88 70
006 F   120   0.73 85

A nice, clean, simple data set. But what you get instead has more than one row per subject:

Subject ID sex race IQ lead ABCD XYZ
001 M white     0.9 62
001 M       0.9 62
003 M black 105   0.75 80
004 F black 90 pos 1.12 40
004 F black 85 pos 1.12 40
004 F black 90 pos 1.0 40
004 F black 85 pos 1.0 40
005 M white 85   0.88 70
006 F   120   0.70 85
006 F   120   0.73 85
006 F asian 120   0.70 85
006 F asian 120   0.73 85
006 F   120   0.70 80
006 F   120   0.73 80
006 F asian 120   0.70 80
006 F asian 120   0.73 80

How can you do research on this? What, if anything, did your data guru do wrong?

One of the challenges of combining data from various measures is the geometric growth that happens in the number of rows when multiple values are available for one or more measures. This results in a Cartesian result set, where there’s not a single canonical row that is “the” row for the subject.

Why Multiple Rows Happen

Consider subject 006. She has two values listed for race – in one record, she has no race recorded, and in the other she is listed as Asian.

As an aside, the OMB, which sets reporting requirements for race by federal funders, asks that subjects be able to check all races that subjects consider themselves, so you should probably never just have one single column for a clear cut race category, but that’s fodder for an additional article!

Which value is correct? You might say that it’s obvious – the missing data should be supressed and the subject should be marked as asian. But is that actually true? What are the sources of the data? Could your subject have declined to offer her race, but someone keyed in something incorrect in a second entry? Did you give the “supress empty data” criterion to the person doing your data integration? Should missing data always be supressed? For every column, or just for some?

Subject 006 also has two separate values for ABCD – 0.73, and 0.70, and two values for XYZ – 80 and 85. Which values are valid? Probably they all are, they were just measured at two (or three, or four) different time points. Which one do you pick? The first in time? Or do you ask for the pair of ABCD and XYZ that has the smallest elapsed time between them? Does it matter which was measured first? What if not every measure has a date associated with it? Or do you want to use something else as your criterion, like “closest of each measure to IQ date”? Did you ask for date of administration? If you can’t have that because it’s PHI, could you get subject age at administration, in something sufficiently precise like months?

Without answering these questions, it’s impossible to know which combination of variables is the “right” one to study, and each combination is equally valid. Each value of ABCD could be matched with each value of XYZ, and each of those combinations could be matched with each race value. Two race values times two ABCD measurements times two XYZ values gives eight possible combinations. This combination of three different sets (race, ABCD, and XYZ) for subject 006 gives eight combinations. This is the Cartesian product.

This is Common

The scenario above is not farfetched, especially when you’re combining data from multiple sources. For example, let’s imagine you’re combining data from the EHR (the ABCD measure is usually collected at well visits, let’s say, around age 12, but some providers do it earlier or later) and a research project you’re doing on the XYZ phenotype. Both the EHR and your project collect race and sex, so there might be some disagreement there. Maybe your research was interventional, so you did a pre- and post- administration of XYZ (are there practice effects?). Or your subjects might have participated in multiple studies, in a field in which the XYZ measure is considered a gold standard. It’s no surprise each study wanted to use it, and your subject might have 2 (or 3, or 5) measurements at different ages.

Resolving Multiple Rows

A lot of the resolution around how to reduce data to one row per subject depends on the kind of research you’re doing and your subject matter expertise around the variables you’re collecting. If you know that there are practice effects or intervention effects associated with the XYZ, you need to mention your critera to your data analyst or data integration expert. “If there are multiple XYZs, I want the earliest administered one”. If you have an “anchor date” (like IQ administration date or date of surgery or tenth birthday) and want the closest administration date of each measure to that anchor date, mention that, and if the order of administration matters (“I want the closest ABCD and XYZ to the IQ date, even if that means one or both of the ABCD and XYZ were measured before the IQ measurement”). The more information you can supply to reduce the otherwise legitimate combinations to give you the single combination of data that best meets your needs is something that requires your input – it’s not something a statistical programmer or data wonk can solve on her own.