How do I collapse data from several columns into one?
Do you have similar data stored across several different columns, and you’d like to combine it into a single column?
For example, perhaps you’re interested in comparing patients who have a diagnosed immunological disorder to those who don’t. Chances are, your raw data doesn’t include a column indicating the presence of an immunological disorder — instead, you might have a set of columns each representing a different diagnosis code, some of which correspond to immunological disorders and some which don’t. How do you summarize this information to get the column you need?
rowSums to collapse across columns
This is a case where an unusual application of the
rowSums function can be useful. When I think of sums, my first thought is applying it to number data — for example, if I want to know how the total cost of a patient’s visit, I might use
rowSums to add up the costs associated with each part of the visit to get the total — but you can also use
rowSums to count things that might not seem like numbers at first glance.
As an example, consider the following made-up data:
# this is the code to generate the fake data n <- 100 data <- data.frame(MRN = sample(10000:99999, size = n), D80 = sample(c(0,1), size = n, replace = TRUE, prob = c(.8, .2)), D81 = sample(c(0,1), size = n, replace = TRUE, prob = c(.8, .2)), D82 = sample(c(0,1), size = n, replace = TRUE, prob = c(.8, .2)), D83 = sample(c(0,1), size = n, replace = TRUE, prob = c(.8, .2)), D84 = sample(c(0,1), size = n, replace = TRUE, prob = c(.8, .2)), D86 = sample(c(0,1), size = n, replace = TRUE, prob = c(.8, .2)), D89 = sample(c(0,1), size = n, replace = TRUE, prob = c(.8, .2)))
And here are the first several rows of the data, so you can see what it looks like. Note that we have ICD10 codes as columns, and for each patient (one row), there’s either a 1 or a 0 for each ICD10 code, indicating whether or not they have that diagnosis (note that these data are all fake, so there’s no real health information here).
Let’s create a new column that just indicates whether each patient has any of these ICD10 codes. We’ll use the
select function from the dplyr package, which is part of the tidyverse, a set of R packages for doing data science (read one of our previous posts for a good overview of dplyr). You’ll also see I use a
%>%, which is called a “pipe”. If you haven’t seen that before, take a look at the explanation of pipes in the free online book R for Data Science.
data$sum_dx <- data %>% select(D80, D81, D82, D83, D84, D86, D89) %>% rowSums(na.rm = TRUE)
Let’s look at that code line by line. We’ll be saving the output of the code to a column in
sum_dx, as indicated by the first bit,
data$sum_dx <- . This code starts with the
data dataframe, and passes it with a pipe to the
select function. The
select function pulls out just the columns you ask for, in this case all of the ICD10 columns. Then we pass that output with another pipe to the
rowSums function, which gets the sum across each row. I set the argument
na.rm = TRUE in
rowSums to tell it to skip over any missing values. If you leave this at its default, then if a patient has a missing value for any of the ICD10 columns, you’ll end up with
NA for their sum, which is probably not what we want here.
selectgives you several options for how to specify columns. In this case, we could have used
select(-patient_id)instead, which would give us every column except
patient_id. We could also have used
select(starts_with("D8"))to give us every column that begins with “D8”. Either of those approaches will yield the same data we end up with here in this case. The best approach for you may depend on what your data are like.
If we run that code, here are the first several rows of our updated
This is almost what we need, but not quite. It would be better if we had a column that just cleanly indicated whether the patient had any of the diagnoses, rather than the number of diagnoses they have.
We can convert our sum column into a yes/no column indicating whether the patient has any diagnoses by using a logical comparison. We’ll write out a condition (“is
sum_dx greater than 0?”), and tell R to record “yes” if the condition is true and “no” if it’s false for each row. We’ll use the if_else function from the dplyr package. We’ll use
mutate to save the results as a new column.
data <- mutate(data, any_dx = if_else(condition = sum_dx > 0, true = "yes", false = "no"))
Note there are two very similar functions in R for doing this kind of thing:
if_else(the one we used here), and
ifelse. You can use either for most situations, but I recommend
if_elsebecause it comes with a built-in option for handling missing data. You can get some unexpected results from
ifelseif you forget to take into account that your data might have missing values.
For each row in the data, R will check whether the value for
sum_dx is greater than 0, and if so, it will record “yes” for a new variable
any_dx, otherwise it will record “no” (and if there were missing values for
sum_dx, those will show up as missing still in
Here’s what the first several rows of our data look like now:
Great! This looks like what we need.
We have several other posts about manipulating data in R:
For a curated list of general resources for learning R, see our R 101 guide.