Welcome to the Tidyverse!
Welcome to the Tidyverse!
You may have heard of the so-called “Hadleyverse” – the ecosystem of R packages authored by Hadley Wickham with a specific approach to data organization (a philosophy I also subscribe to). I’ve often sung the praises of dplyr, the magrittr “pipe” ( %>% ), and ggplot2. These as well as other packages form what is now termed the “tidyverse”. And now the tidyverse (Hadley told us he was grateful to shed the eponymous title) is more than a concept or loose affiliation – it’s a real thing.
If you use R, instead of loading tidyr, dplyr, purrr, magrittr, ggplot2, etc., you can simply do the following once:
install.packages("tidyverse")
And then this, in every analytics script you have. It’ll load the most frequently used packages into your working environment.
library(tidyverse)
You’ve probably heard talk before about what makes tidy data so important. If you need a refresher, there really isn’t a better article than Hadley’s Classic.
For those who don’t have time even for that, let’s briefly do a case study.
Let’s consider data that looks something like this, in a spreadsheet or journal article:
|
Mean QPT - pretreatment |
SD QPT -- pretreatment |
Mean QPT - post-treatment |
SD QPT -- post-treatment |
||||
|
m |
f |
m |
f |
m |
f |
m |
f |
Depression alone (n=9m, 7f) |
122 |
137 |
28.1 |
27.0 |
109 |
140 |
26.0 |
39.5 |
Depression with Anxiety (n=12m,8f) |
130 |
145 |
25.0 |
19.8 |
103 |
142 |
24.9 |
40.1 |
Controls / Neither Depression nor Anxiety (n=10m, 10f) |
107 |
110 |
15.8 |
13.9 |
88 |
95 |
21.8 |
20.6 |
Anxiety Alone (n=13m, 15f) |
124 |
119 |
20.7 |
18.3 |
100 |
110 |
24.2 |
20.5 |
What distinct things are we measuring? Give it a good think before scrolling down.
I’d argue that we have:
- Depression / No Depression dx
- Anxiety / No Anxiety dx
- Sex
- Timepoint (pre- vs post- treatment)
- Mean QPT value
- SD QPT value
- n
This data is clearly not tidy:
- There is a lot of repetition in column and row names (we see m for male a total of 8 times and “post-treatment” twice)
- We are combining related variables (Depression and Anxiety diagnoses together, timepoint and measure together) and less well-related variables (like combined diagnoses and n)
While this may be concise enough to understand as a human, it’s a bit hard to use in programming. We’d have to scrape a data frame like the one above in different ways (leading to major code complication) in order to answer these questions:
- How does our n impact the SD of QPT?
- Do males or females have a larger pre- to post- treatment change, as reflected in mean?
- How many females with anxiety did we see? How many males with depression?
- Pre-treatment, is there greater score variability in people with or without depression? Is this different between those who have anxiety and those who don’t?
A “tidy” approach to data means, principally, putting discrete single observations in rows (an observation could be a single subject on a single day, or a homogenous group in a single treatment phase, for example), and variables in columns.
It is a “tidier” approach to structure the data like this:
- Depression +/- dx
- Anxiety +/- dx
- Sex
- Timepoint (pre- vs post- treatment)
- Mean QPT value
- SD QPT value
- n
Depression status | Anxiety status | Sex | Count | Timepoint | Mean QPT | SD QPT |
---|---|---|---|---|---|---|
pos | pos | m | 12 | pre | 130 | 25 |
pos | pos | f | 8 | pre | 145 | 19.8 |
pos | pos | m | 12 | post | 103 | 24.9 |
pos | pos | f | 8 | post | 142 | 40.1 |
pos | neg | m | 9 | pre | 122 | 28.1 |
pos | neg | f | 7 | pre | 137 | 27.0 |
pos | neg | m | 9 | post | 109 | 26.0 |
pos | neg | v | 7 | post | 140 | 39.5 |
neg | pos | m | 13 | pre | 124 | 20.7 |
neg | pos | f | 15 | pre | 119 | 18.3 |
neg | pos | m | 13 | post | 100 | 24.2 |
neg | pos | f | 15 | post | 110 | 20.5 |
neg | neg | m | 10 | pre | 107 | 15.8 |
neg | neg | f | 10 | pre | 110 | 13.9 |
neg | neg | m | 10 | post | 88 | 21.8 |
neg | neg | f | 10 | post | 95 | 20.6 |
In this case, each column is measuring one thing and one thing only, and each row represents a single observation (with the unit of observation being a group homogeneous on sex and 2 diagnostic fields at a single time). In this table, computation will be much easier. Columns can be easily compared on the entire group or only on selected observations (for example, we could select only the rows with males, or only the pretest rows, or only the rows with negative anxiety). Getting data into a tidy format is always helpful, and sometimes necessary, before additional discovery and analysis can take place.
Non-tidy data has a place – namely, outside of computation. It makes sense that you might want to display complex data in a compact way for human consumption. For use within R, tidy is the way to go.
Once you have tidy data, it’s much easier to use the “tidyverse” suite of tools to do data reshaping, combining, etc. Here’s a breakdown of some of the tools that form this ecosystem:
- broom, a package which takes the messy output of frequently used but un-tidy functions in R, such as lm and t.test, and turns them into tidy data frames
- DBI, a database abstraction layer that handles the underlying differences between database types
- dplyr, a library for working with data frames more effectively, especially selecting rows and columns for analysis
- forcats, an anagram of ‘factors’, and a library for working with factor (categorical) variables
- ggplot2, a graphical display library for making complex, attractive data visualizations
- haven, which lets you import and export SPSS, Stata, and SAS files
- httr, which allows you to work more easily with data pulled from web pages
- hms, which stands for “hours, minutes, seconds” and handles time difference data
- jsonlite, a library for handling JSON data
- lubridate, which helps handle date functions
- magrittr, the “pipe” (%>%)
- modelr, which provides helpful tools for modeling (overlaps somewhat with caret)
- purrr, which helps scaffold functional purity (and replaces the *apply group of functions)
- readr, for importing tabular data
- readxl, for working with Excel files
- stringr, a library for working with strings easily
- tibble, which corrects some problems with the data frame structure (good for complex, nested data or data with very similar column names)
- rvest, a web-scraping tool for parsing website data
- tidyr, a library that helps reshape data into a tidy format
- xml2, a library for dealing with XML data (usually from the web)
You may notice that a lot of these tools overlap with other, base R tools that you already use (or other packages provided from other authors). You can certainly mix-and-match, but one of the advantages of the tidyverse is not only that all the packages share the same assumptions and conventions (less syntax to learn), but that they all try to be very explicit and make fewer dumb mistakes with your data. As an example, a tibble object (which overlaps a lot with data frames generally) won’t make “best guess column name” assumptions (like when you type mydata$a and it returns mydata$abc, since you don’t have a column ‘a’, but you do have a column ‘abc’). If you’re new to R, it’s a good idea to start with these packages. If you already have a bunch of scripts already written, and they work, there’s not necessarily a reason to go back and rewrite them using these functions. They are one possible toolset out of a nearly infinite permutation of perfectly good packages in the R ecosystem.