R 4 Beginners Chapter 5 - Data Transformation
R 4 Beginners
An exploration of data science as taught in R for Data Science by Hadley Wickham and Garrett Grolemund. This blog is meant to be a helpful addition to your own journey through the book. No prior programming experience necessary!
Chapter 5: Data Transformation
In a perfect world, any data we need would come to us in exactly the way we need it: tidy, complete with all of the variables we want to analyze, and everything descriptively named. Alas, in this world such a thing rarely happens, and so we usually have to do some work to our data before we can analyze it. That’s where data transformation comes in! The package that handles most data transformation in the
dplyr. There are a variety of “verbs” in
dpylr that allow you to manipulate data in various ways.
filter() function does basically what you’d expect it to do given the name– it filters the data based on criteria that you specify, returning a new data frame with only the rows that meet those criteria. The simplest form of a filtering operation is this:
new_data <- filter(DATA, COMPARISON OPERATION)
For a “real” example, we’ll use the
flights data frame from the
nycflights13 package (so this time you’ll need to load both the
nycflights13 packages before getting started). Take a look at the
flights data frame first to see what data it contains. Once you’ve done that, you can decide what subset of data you’d like to look at. For example, let’s say we wanted to just look at flights that left New York City in July:
july_flights <- filter(flights, month == 7)
If you run this code, you’ll see that the new data frame
july_flights contains many fewer rows than
flights, and if you look at the
month column, it is always “7” (which corresponds to July). You aren’t limited to only one criterion either! Try running this code and see what happens:
nye_flights <- filter(flights, month == 12, day == 31)
This is useful because if you’re only interested in flights that left NYC on New Year’s Eve, this is an easy way to isolate those data. This is simple in theory, but it can get tricky in practice because of the strangeness that is involved in using comparison operators and logical operators.
Comparison and logical operators
In the code above, the argument where we set our filtering criteria (
month == 7) is a comparison operator, or relational operator. Comparison operators are basically asking questions like “are these two things the same?”, or “is x greater than y?”. There are several comparison operators that are important to know:
==: equal to (be sure to use the
=, which does a different thing in R and isn’t used to compare values– you’ll get an error if you try to use
!=: not equal to
<: less than
>: greater than
<=: less than or equal to
>=: greater than or equal to
There is another slightly unusual comparison operator
near(), which works like
== except that it’s a “safer” option when comparing floating point numbers (decimals) in which the computer’s non-infinite precision makes a comparison that should work not work. In R4DS they use the example of
sqrt(2)^2 == 2, which should obviously evaluate to
TRUE but instead evaluates to
FALSE, and so
near() is required to get
TRUE. Read the
near() documentation for more information.
More complex operations can be constructed using logical operators, or boolean operators:
&: AND (as in
x&y, x AND y)
|: OR (as in
x|y, x OR y)
!: NOT (as in
!x, NOT x)
nye_flights example above, the
& is already built into the function, but you could also write it like this:
nye_flights <- filter(flights, month == 12 & day == 31)
Try another operation, like
month == 4 | month == 5 and see what happens! Just a note, if you’re using anything that isn’t
&, you’ll need to be explicit. And read up more on logical operators, they can be tricky (it’s also a good idea to look at your data when you do operations like this to make sure you’re getting what you think you’re getting).
One more thing to note is that almost any operation you try to do in which missing values (
NA in R parlance) are involved will result in an unknown value. You can use
is.na() to look for missing values (for example, if
x <- NA, the operation
x == NA will evaluate to
is.na(x) evaluates to
arrange() function reorders the rows of a data frame using columns that you choose (the first column you specify will get priority, then the second, and so on). For example, the
flights data frame is already ordered in ascending order of
month, followed by
day, but say you wanted to look at the most delayed flights (in terms of arrival time) for each month:
arrange(flights, month, desc(arr_delay))
Since each column that you specify after the first will break ties in the first column (in this case
month), you end up with the flights in each month reordered in descending order from the most delayed arrival to the least. If instead you wanted the flights that had the smallest arrival delay first, you’d omit the
desc() function for
arr_delay (the default is to sort in ascending order, as
month is in this example).
select() function is a way to narrow down your data frame to just the columns that you want, which is especially useful if your data set has dozens or even hundreds of columns but you’re only interested in a few. The first argument must be the data frame you’re taking the columns from, followed by the columns you want from that data frame (they will appear in the new data frame in the order you list them). You can list all of the columns explicitly (e.g.
select(flights, year, month, day, dep_time)), use subsetting notation (
select(flights, year:dep_time)), or use the minus sign to give you all of the columns except the ones you specify (so
select(flights, -(carrier:dest) would give you every column except
dest; you can think of it as a way to drop columns).
And just as a reminder, it’s good practice to be explicit whenever possible. Subsetting notation can save a lot of time (especially if you want a lot of columns) but it’s possible that “future” you will come back to your code in 6 months and not know exactly what “past” you was trying to accomplish! Worse, if you’re using data from a source that isn’t yours, the structure of the data set could change without you realizing it, and your subsetting will no longer be doing what you think it’s doing. If you need to use subsetting, do yourself a favor and 1) make good notes about exactly what you’re trying to do (I’ll talk more about how to do that in the next chapter), and 2) take a peek at the data after your manipulations, so you know what you have.
select() is often used with a variety of matching functions:
starts_with(): Lets you match a string to the beginning of the column names. For example,
select(flights, starts_with("dep"))would return a data frame with only the columns
dep_delay. Not super useful, but you get the idea.
ends_with(): Same as
starts_with(), except it matches the end instead of the beginning.
contains(): Looks for the string anywhere in the columns names. For example,
select(flights, contains("time"))returns a data frame with the columns
num_range(): Looks for column names with a range of numbers. The
flightsdata set doesn’t have any columns like this, but if for example you had columns names like
layover_3you could get these columns with
select(flights, num_range("layover_", 1:3))
matches(): Looks for column names that match a regular expression. Don’t worry about this for now, we’ll come back to regular expressions later (they’re powerful, but tricky).
select()function can also be used to reorder columns, and
everything()is useful in this case (since
select()only keeps the columns you specify). If you wanted all of the columns in
flightsbut instead wanted the flight number, the origin, and the destination at the beginning, you would run
select(flights, flight, origin, dest, everything()).
mutate() is the function used to add new columns to your data frame. This is most useful when you want to calculate new values based on the data in the data frame, like if your data set gives you columns for distance and time for a cohort of marathon runners, but you need speed for your analysis. You can use
mutate() to calculate a
speed column from the data in the
time columns. Going back to our
flights data set, one thing that might make the data difficult to work with are the time columns (
sched_arr_time), where the time is in a 24 hour system. You can’t perform mathematical operations on time values in this form generally, so if we want to do calculations with them we’ll need to change them to a continuous format, such as “minutes since midnight”. Since the times are in the format “hhmm” (meaning, 10:00 am would be shown as 1000, and 2:30 pm would be 1430), you can divide by 100 to get the number of hours since midnight, then multiply by 60 to get those hours in minutes, then add the remainder (the minutes) from the first operation to the result from the second. For example:
- 4:45 pm would be represented as 1645 in the
- 1645/100 is 16 hours with a remainder of 45 minutes.
- There are 960 minutes in 16 hours (16*60).
- Therefore, the total number of minutes since midnight would be 960+45=1005.
The great thing about
mutate() is that once we know the operations we need, we can compute an entirely new column using the values of an existing column (or another data source, if there is a way to link the two data frames, but that’s a problem for another day). So let’s go back and create four new columns, changing the times I listed above to minutes since midnight. Since
mutate() (like all of the transformation functions we’ve talked about) returns a new data frame, we can assign it to a new variable. A couple of arithmetic operators that are particularly useful here are
%/%, which returns the integer result of a division operation, and
%% which returns the remainder (this is called the modulo operator).
The last thing we need to take into account is midnight. In our new columns, since we want the time to be measured as minutes from midnight, midnight itself should clearly be
0. In fact, in a 24 hour clock midnight is represented as 2400, which according to our method above would calculate midnight to be 1440 minutes from midnight. We can correct for this by using the operation
x %% 1440, where
x represents our previous calculation to get the hours and minutes. This is because when both
y are positive integers and
x < y, then
x %% y = x. If we say that
x is time in minutes from midnight, then by definition all values for
x will be less than 1440. Therefore,
x %% 1440 = x in every case except at midnight. At midnight x = 1440, and
1440 %% 1440 = 0.
So, given all of those considerations, we can get our four new columns like this:
flight_mins <- mutate(flights, dep_mins = (dep_time %/% 100 * 60 + dep_time %% 100) %% 1440, sched_dep_mins = (sched_dep_time %/% 100 * 60 + sched_dep_time %% 100) %% 1440, arr_mins = (arr_time %/% 100 * 60 + arr_time %% 100) %% 1440, sched_arr_mins = (sched_arr_time %/% 100 * 60 + sched_arr_time %% 100) %% 1440)
I know that was a lot! I include it because it was tricky for me to figure out (the above example is actually exercise 1 in section 5.5.2), but it also includes a lot of good info about using
mutate(). Play around with some other arithmetic operators, see what you come up with. There are the usual suspects like addition, subtraction, multiplication, etc. but there are also logs (useful if your data span many orders of magnitude, and are therefore hard to work with or visualize), ranking (such as
min_rank()), and offsets (such as
lag()). Use the “help” tab in RStudio to see what all of these operators do.
And don’t forget about the order of operations when you’re doing your arithmetic!
summarise(), as the name suggests, give you a summary of the data that you pass into it, using a calculation such as
median(), standard deviation (
n()), etc. For example, try running this in your console:
summarise(flights, ave_arr_delay = mean(arr_delay, na.rm = TRUE))
You should see a table with a single value, the mean arrival delay (the
na.rm argument is telling the function to ignore missing values– it is
FALSE by default, but trying to calculate the mean with missing values will return an unknown value).
This is not super useful on its own, but combined with
group_by() it becomes quite powerful, because it allows you to do group-wise analysis. Say we want to compare the average arrival delay by carrier. Try running this in your console:
by_carrier <- group_by(flights, carrier) summarise(by_carrier, delay = mean(arr_delay, na.rm = TRUE))
You can add more summary functions as additional arguments if you want; they’ll be added as additional columns in the summary data frame. Some of the most useful summary functions are listed in the help for
When performing summary statistics on a data set, it is important to be aware that the number of observations can affect how informative your conclusions are. This is because small data sets have more variation, and so with larger sample sizes, the variation in the summary calculation (like mean) will decrease. Actually, there are several factors that go into choosing an appropriate sample size for a study, but for initial data transformations it’s just important to look at the sample size before drawing conclusions based on a summary statistic.
There are a few functions that return the number of observations (or count) that work in slightly different ways:
n(): The simplest function– it takes no argument, and just returns the group size (this is often used within a
sum(!is.na()): Gives the group count, excluding missing values.
n_distinct(): Counts distinct (unique) values.
count(): Basically a way to combine
n()in one step, if all you want is a group-wise count.
Putting It All Together
As you might imagine, if you have a large data set you might need to perform many transformations to get your data into the form that you need. The way we’ve been writing code above, you would need to assign each new data frame to an object at each step, even though you won’t need those objects ever again! Seems like a waste, right? Luckily, there is a solution for this! It’s called the pipe.
The pipe comes from a package called
magrittr, which is loaded with the
tidyverse. The pipe is great because 1) you don’t have to come up with a bunch of names for stuff you won’t need, and 2) it makes your code more readable. Let’s try an example. Let’s say I want to look at the airline carriers that are “best” at making up time in flight after a delayed departure:
time_gain_by_carrier <- mutate(flights, time_gain = dep_delay - arr_delay) %>% group_by(carrier) %>% summarise(mean(time_gain, na.rm = TRUE))
That’s two names you don’t have to think of! The output of each step in the pipe goes straight into the next, so that each intermediate step doesn’t need to be stored in its own object. The pipe is very powerful and if you continue to work with data in R you will use it a lot.
That’s a lot of info! Data transformation is a big topic. If you need your data in a certain form and you’re not sure how to make it happen, start with your search engine of choice– if you have a question, it’s likely others have had the same one. If you can’t find answers there, ask your community! Send us an email here at Arcus Education, or join your local R User Group (if you’re part of the CHOP/UPenn community, go here to sign up for the CHOP R User Group). The answers are out there!
In the next chapter, we’ll take a bit of a detour and talk about R scripts, R Markdown, and reproducible programming!