Getting to One Row
Let’s say you’ve requested data from the Clinical Reporting Unit (CRU), or a research assistant in your lab, or a research collaborator. It may come in several tables, each of which has rich, interesting data. But you want the data in a flattened format, just one row per subject. You mention this, and there’s a bit of difficulty in getting you exactly what you want. Why is this the case?
Let’s recall, first, that human-centric data in particular is very messy. People have varying vital signs measured at different times, may have completed questionnaires more than once, take anywhere from one to dozens of medications, etc. It’s hard to come up with a perfect schema (set of columns / shape of dataset) that fits all cases. That’s why data professionals will often give you raw data in various datasets, instead of combining them into one. The choice of how to combine the data is really up to the subject matter expert (you!), and there’s no single right way to do it.
For example, let’s take the case of combining surgical patients and their antibiotic use (note, I’m not a clinician, forgive my terrible prescribing practices).
We have a (very simplified) surgical procedure table that starts like this:
And an antibiotic use table (again, very simplified) like this:
|1||2016-03-04||DOXYCYCLINE MONOHYDRATE 100 MG OR TABS|
|1||2016-03-05||DOXYCYCLINE MONOHYDRATE 100 MG OR TABS|
|1||2016-03-06||DOXYCYCLINE MONOHYDRATE 100 MG OR TABS|
|2||2017-08-18||AMPICILLIN-SULBACTAM 30 MG AMP/ML (NSS) INJECTION CUSTOM|
|2||2017-08-18||AZITHROMYCIN 250 MG OR TABS|
|4||2018-03-21||METRONIDAZOLE 250 MG OR TABS|
What do we notice? First of all, it seems that the surgical procedure table has one row per patient. But is that really true? Is it possible that patients appear twice because of different procedures on different dates? Or because of multiple procedures on the same date? Or because a procedure spanned midnight? We have to check our assumptions (which you can do easily in a statistical programming environment like R).
What else do we notice? The antibiotic use table has multiple rows for some patients, missing rows for others, and single row for one patient. One patient received a single antibiotic over several days, another patient had two different antibiotics on the same day, and a third patient had an antibiotic prescribed well past the date of surgery – is this relevant?
Generally, when we combine data from multiple tables, linking by some common field (like the PT column), we get a Cartesian product, such that each combination of tables grows the number of rows, geometrically (i.e. by multiplication). So since PT 1 has one row in the surgical procedure table and 3 rows in the antibiotics table, PT 1 would have 1 X 3, or 3, rows in the result table. This is what that looks like:
|1||2016-03-04||2016-03-04||DOXYCYCLINE MONOHYDRATE 100 MG OR TABS|
|1||2016-03-04||2016-03-05||DOXYCYCLINE MONOHYDRATE 100 MG OR TABS|
|1||2016-03-04||2016-03-06||DOXYCYCLINE MONOHYDRATE 100 MG OR TABS|
|2||2017-08-17||2017-08-18||AMPICILLIN-SULBACTAM 30 MG AMP/ML (NSS) INJECTION CUSTOM|
|2||2017-08-17||2017-08-18||AZITHROMYCIN 250 MG OR TABS|
|4||2018-01-30||2018-03-21||METRONIDAZOLE 250 MG OR TABS|
Often that combinatorial approach may be exactly what you want (say, if you were predicting how long after surgery various antibiotics tended to be prescribed). But in this case, let’s assume it’s not.
If you said to your data wrangler, “can you flatten this”, how could they cram all of that multi-row data related to antibiotics into a single row per patient?
One way would be to make each repetition in the antibiotics table its own set of columns:
|1||2016-03-04||2016-03-04||DOXYCYCLINE MONOHYDRATE 100 MG OR TABS||2016-03-05||DOXYCYCLINE MONOHYDRATE 100 MG OR TABS||2016-03-06||DOXYCYCLINE MONOHYDRATE 100 MG OR TABS|
|2||2017-08-17||2017-08-18||AMPICILLIN-SULBACTAM 30 MG AMP/ML (NSS) INJECTION CUSTOM||2017-08-18||AZITHROMYCIN 250 MG OR TABS|
|4||2018-01-30||2018-03-21||METRONIDAZOLE 250 MG OR TABS|
The evident problem with this is that it:
- causes huge growth in the number of columns (imagine if there were more details in the antibiotics table, and one patient had 17 separate antibiotics administrations!)
- may make statistical analysis difficult because of sparsity, aka missingness (most patients only have 1-2 administrations, but the most highly-medicated patient has 17, so that means everyone has to have 17 sets of antibiotics data, most of which are empty)
So what other options are there?
I always suggest to researchers that they hone their criteria and data definitions. So, instead of:
- “Can you give me a list of orthopedic surgical patients and the dates of their surgeries for the past three years?”
You might want to consider:
- “Can you give me a list of orthopedic surgical patients for the past three years who have had only one orthopedic surgery, and the date that surgery ended?” (Handles repeats and surgeries spanning 2 calendar days)
Instead of saying:
- “I want the antibiotics they were prescribed post-surgery.”
You might want to ask, instead:
- “Please also give me any antibiotics prescribed to the patient within 2 weeks of the date of the end of surgery.”
When you’re thinking about integrating the procedure table and the antibiotics table, don’t just say “I want antibiotics in columns, too”. Instead, you might say one or more of the following (to yourself, or to your data analyst):
- “I want to see the various possible classes of antibiotics as columns, with 0/1 coding for each patient, indicating whether they received this class in the two weeks post-surgery.” (benefit: sets the max number of columns to the number of antibiotic classes)
- “I want to see the total days of post-surgical antibiotic treatment as a column.” (benefit: will be one column only)
- “I want to see the number of distinct antibiotics used post-surgically as a column.” (benefit: will be one column only)
- “I want to see the number of days elapsed between surgery and first antibiotic rx as a column.” (benefit: will be one column only)
Think about what will help you in statistical modeling (or whatever your goal is) – what seems predictive? What data is needed to support or refute your hypothesis? If missing data will be a problem, how will you design your dataset so that you reduce or eliminate missing data?
Note that it’s much easier to do things like counts and classifications (we could call this kind of combination of multiple rows of data “aggregation”) on the raw data. This is why when data is extracted for you, it’s generally given to you without combining with other datasets, so that you can make the decisions on how to do that, based on your specific questions.