How do I collapse data from several columns into one?
Do you have similar data stored across several different columns, and you’d like to combine it into a single column?
For example, perhaps you’re interested in comparing patients who have a diagnosed immunological disorder to those who don’t. Chances are, your raw data doesn’t include a column indicating the presence of an immunological disorder — instead, you might have a set of columns each representing a different diagnosis code, some of which correspond to immunological disorders and some which don’t. How do you summarize this information to get the column you need?
Using rowSums
to collapse across columns
This is a case where an unusual application of the rowSums
function can be useful. When I think of sums, my first thought is applying it to number data — for example, if I want to know how the total cost of a patient’s visit, I might use rowSums
to add up the costs associated with each part of the visit to get the total — but you can also use rowSums
to count things that might not seem like numbers at first glance.
As an example, consider the following made-up data:
# this is the code to generate the fake data
n <- 100
data <- data.frame(MRN = sample(10000:99999, size = n),
D80 = sample(c(0,1), size = n, replace = TRUE, prob = c(.8, .2)),
D81 = sample(c(0,1), size = n, replace = TRUE, prob = c(.8, .2)),
D82 = sample(c(0,1), size = n, replace = TRUE, prob = c(.8, .2)),
D83 = sample(c(0,1), size = n, replace = TRUE, prob = c(.8, .2)),
D84 = sample(c(0,1), size = n, replace = TRUE, prob = c(.8, .2)),
D86 = sample(c(0,1), size = n, replace = TRUE, prob = c(.8, .2)),
D89 = sample(c(0,1), size = n, replace = TRUE, prob = c(.8, .2)))
And here are the first several rows of the data, so you can see what it looks like. Note that we have ICD10 codes as columns, and for each patient (one row), there’s either a 1 or a 0 for each ICD10 code, indicating whether or not they have that diagnosis (note that these data are all fake, so there’s no real health information here).
patient_id | D80 | D81 | D82 | D83 | D84 | D86 | D89 |
---|---|---|---|---|---|---|---|
32556 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
82056 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
68418 | 1 | 1 | 0 | 0 | 0 | 0 | 1 |
24072 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
61351 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
68999 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
Let’s create a new column that just indicates whether each patient has any of these ICD10 codes. We’ll use the select
function from the dplyr package, which is part of the tidyverse, a set of R packages for doing data science (read one of our previous posts for a good overview of dplyr). You’ll also see I use a %>%
, which is called a “pipe”. If you haven’t seen that before, take a look at the explanation of pipes in the free online book R for Data Science.
data$sum_dx <- data %>%
select(D80, D81, D82, D83, D84, D86, D89) %>%
rowSums(na.rm = TRUE)
Let’s look at that code line by line. We’ll be saving the output of the code to a column in data
called sum_dx
, as indicated by the first bit, data$sum_dx <-
. This code starts with the data
dataframe, and passes it with a pipe to the select
function. The select
function pulls out just the columns you ask for, in this case all of the ICD10 columns. Then we pass that output with another pipe to the rowSums
function, which gets the sum across each row. I set the argument na.rm = TRUE
in rowSums
to tell it to skip over any missing values. If you leave this at its default, then if a patient has a missing value for any of the ICD10 columns, you’ll end up with NA
for their sum, which is probably not what we want here.
Note that
select
gives you several options for how to specify columns. In this case, we could have usedselect(-patient_id)
instead, which would give us every column exceptpatient_id
. We could also have usedselect(starts_with("D8"))
to give us every column that begins with “D8”. Either of those approaches will yield the same data we end up with here in this case. The best approach for you may depend on what your data are like.
If we run that code, here are the first several rows of our updated data
object:
patient_id | D80 | D81 | D82 | D83 | D84 | D86 | D89 | sum_dx |
---|---|---|---|---|---|---|---|---|
32556 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
82056 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 2 |
68418 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 3 |
24072 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
61351 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 2 |
68999 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
This is almost what we need, but not quite. It would be better if we had a column that just cleanly indicated whether the patient had any of the diagnoses, rather than the number of diagnoses they have.
Using if_else
We can convert our sum column into a yes/no column indicating whether the patient has any diagnoses by using a logical comparison. We’ll write out a condition (“is sum_dx
greater than 0?”), and tell R to record “yes” if the condition is true and “no” if it’s false for each row. We’ll use the if_else function from the dplyr package. We’ll use mutate
to save the results as a new column.
data <- mutate(data, any_dx = if_else(condition = sum_dx > 0, true = "yes", false = "no"))
Note there are two very similar functions in R for doing this kind of thing:
if_else
(the one we used here), andifelse
. You can use either for most situations, but I recommendif_else
because it comes with a built-in option for handling missing data. You can get some unexpected results fromifelse
if you forget to take into account that your data might have missing values.
For each row in the data, R will check whether the value for sum_dx
is greater than 0, and if so, it will record “yes” for a new variable any_dx
, otherwise it will record “no” (and if there were missing values for sum_dx
, those will show up as missing still in any_dx
).
Here’s what the first several rows of our data look like now:
patient_id | D80 | D81 | D82 | D83 | D84 | D86 | D89 | sum_dx | any_dx |
---|---|---|---|---|---|---|---|---|---|
32556 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | no |
82056 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 2 | yes |
68418 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 3 | yes |
24072 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | no |
61351 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 2 | yes |
68999 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | yes |
Great! This looks like what we need.
Further reading
We have several other posts about manipulating data in R:
For a curated list of general resources for learning R, see our R 101 guide.