 Rose Hartman

# How do I collapse data from several columns into one?

Do you have similar data stored across several different columns, and you’d like to combine it into a single column?

For example, perhaps you’re interested in comparing patients who have a diagnosed immunological disorder to those who don’t. Chances are, your raw data doesn’t include a column indicating the presence of an immunological disorder — instead, you might have a set of columns each representing a different diagnosis code, some of which correspond to immunological disorders and some which don’t. How do you summarize this information to get the column you need?

## Using rowSums to collapse across columns

This is a case where an unusual application of the rowSums function can be useful. When I think of sums, my first thought is applying it to number data — for example, if I want to know how the total cost of a patient’s visit, I might use rowSums to add up the costs associated with each part of the visit to get the total — but you can also use rowSums to count things that might not seem like numbers at first glance.

As an example, consider the following made-up data:

# this is the code to generate the fake data
n <- 100
data <- data.frame(MRN = sample(10000:99999, size = n),
D80 = sample(c(0,1), size = n, replace = TRUE, prob = c(.8, .2)),
D81 = sample(c(0,1), size = n, replace = TRUE, prob = c(.8, .2)),
D82 = sample(c(0,1), size = n, replace = TRUE, prob = c(.8, .2)),
D83 = sample(c(0,1), size = n, replace = TRUE, prob = c(.8, .2)),
D84 = sample(c(0,1), size = n, replace = TRUE, prob = c(.8, .2)),
D86 = sample(c(0,1), size = n, replace = TRUE, prob = c(.8, .2)),
D89 = sample(c(0,1), size = n, replace = TRUE, prob = c(.8, .2)))


And here are the first several rows of the data, so you can see what it looks like. Note that we have ICD10 codes as columns, and for each patient (one row), there’s either a 1 or a 0 for each ICD10 code, indicating whether or not they have that diagnosis (note that these data are all fake, so there’s no real health information here).

patient_id D80 D81 D82 D83 D84 D86 D89
32556 0 0 0 0 0 0 0
82056 1 0 0 0 0 0 1
68418 1 1 0 0 0 0 1
24072 0 0 0 0 0 0 0
61351 0 0 1 0 1 0 0
68999 0 0 1 0 0 0 0

Let’s create a new column that just indicates whether each patient has any of these ICD10 codes. We’ll use the select function from the dplyr package, which is part of the tidyverse, a set of R packages for doing data science (read one of our previous posts for a good overview of dplyr). You’ll also see I use a %>%, which is called a “pipe”. If you haven’t seen that before, take a look at the explanation of pipes in the free online book R for Data Science.

data$sum_dx <- data %>% select(D80, D81, D82, D83, D84, D86, D89) %>% rowSums(na.rm = TRUE)  Let’s look at that code line by line. We’ll be saving the output of the code to a column in data called sum_dx, as indicated by the first bit, data$sum_dx <- . This code starts with the data dataframe, and passes it with a pipe to the select function. The select function pulls out just the columns you ask for, in this case all of the ICD10 columns. Then we pass that output with another pipe to the rowSums function, which gets the sum across each row. I set the argument na.rm = TRUE in rowSums to tell it to skip over any missing values. If you leave this at its default, then if a patient has a missing value for any of the ICD10 columns, you’ll end up with NA for their sum, which is probably not what we want here.

Note that select gives you several options for how to specify columns. In this case, we could have used select(-patient_id) instead, which would give us every column except patient_id. We could also have used select(starts_with("D8")) to give us every column that begins with “D8”. Either of those approaches will yield the same data we end up with here in this case. The best approach for you may depend on what your data are like.

If we run that code, here are the first several rows of our updated data object:

patient_id D80 D81 D82 D83 D84 D86 D89 sum_dx
32556 0 0 0 0 0 0 0 0
82056 1 0 0 0 0 0 1 2
68418 1 1 0 0 0 0 1 3
24072 0 0 0 0 0 0 0 0
61351 0 0 1 0 1 0 0 2
68999 0 0 1 0 0 0 0 1

This is almost what we need, but not quite. It would be better if we had a column that just cleanly indicated whether the patient had any of the diagnoses, rather than the number of diagnoses they have.

## Using if_else

We can convert our sum column into a yes/no column indicating whether the patient has any diagnoses by using a logical comparison. We’ll write out a condition (“is sum_dx greater than 0?”), and tell R to record “yes” if the condition is true and “no” if it’s false for each row. We’ll use the if_else function from the dplyr package. We’ll use mutate to save the results as a new column.

data <- mutate(data, any_dx = if_else(condition = sum_dx > 0, true = "yes", false = "no"))


Note there are two very similar functions in R for doing this kind of thing: if_else (the one we used here), and ifelse. You can use either for most situations, but I recommend if_else because it comes with a built-in option for handling missing data. You can get some unexpected results from ifelse if you forget to take into account that your data might have missing values.

For each row in the data, R will check whether the value for sum_dx is greater than 0, and if so, it will record “yes” for a new variable any_dx, otherwise it will record “no” (and if there were missing values for sum_dx, those will show up as missing still in any_dx).

Here’s what the first several rows of our data look like now:

patient_id D80 D81 D82 D83 D84 D86 D89 sum_dx any_dx
32556 0 0 0 0 0 0 0 0 no
82056 1 0 0 0 0 0 1 2 yes
68418 1 1 0 0 0 0 1 3 yes
24072 0 0 0 0 0 0 0 0 no
61351 0 0 1 0 1 0 0 2 yes
68999 0 0 1 0 0 0 0 1 yes

Great! This looks like what we need.