When R Gets Too Helpful
When R is too helpful
Keep in mind that R is decades old (by way of S), and has matured over that time as users have learned its pitfalls. However, to maintain backwards compatibility, many of these suboptimal behaviors still persist. A major example of this is when an R package (especially base R) tries to help too much, and guess at what you want.
Let’s look at some examples of when R helps too much.
Guessing Column Names
Base R will try to help you by guessing at what you meant at certain times. This can be helpful, but can also be misleading and error-prone. Let’s say we have two data frames for our subjects, one of which shows their access to affordable healthcare and another that has scores including a flexibility assessment and IQ:
ins <- data.frame(subjID = c("Sub-12345", "Sub-ABCDE"), insured = c(TRUE, TRUE), flexSpendingCap = c(1000, 2500)) scores <- data.frame(subjID = c("Sub-12345", "Sub-ABCDE"), flex=c(100, 80), iq=c(110, 95))
What if I want to get the average of our subjects’ flexibility scores, but I copy-paste wrong and use the wrong data frame?
> mean(ins$flex)  1750
R guesses that “you must have meant flexSpendingCap”, and the command goes through with no errors or warnings. This is clearly not what you wanted!
The easiest way to fix this is to use the tidyverse “tibble”, which is a data frame with some added protections built in. One of them is to avoid “over-helping” by returning the best match when a subset doesn’t actually match anything. To use tibbles, you’d need to use data_frame:
library(tibble) ins <- data_frame(subjID = c("CAR-12345", "CAR-ABCDE"), insured = c(TRUE, TRUE), flexSpending = c(1000,2500)) scores <- data_frame(subjID = c("CAR-12345", "CAR-ABCDE"), flex=c(100, 80), iq=c(110, 95))
Now, if we try the same command, we’ll get a helpful error:
> mean(ins$flex)  NA Warning messages: 1: Unknown column 'flex' 2: In mean.default(ins$flex) : argument is not numeric or logical: returning NA
Guessing Factor Variables
Another problem in the “R guessing what you mean” world is R’s tendency to turn strings into factor (categorical) variables. Sometimes that’s what you want (like when you have true categorical data like “good”, “fair”, “poor”), but often it’s not (and you might not realize until way down in your data processing that you have a problem!).
transcript <- data.frame(subject = c("Sub-12345", "Sub-23456"), phrase=c("I disagree, you're totally wrong!", "Do you think I should ask her?"))
Let’s say we wanted to figure out the length of all the phrases, in characters, to select some extra-long phrases for additional analysis.
> nchar(transcript$phrase) Error in nchar(transcript$phrase) : 'nchar()' requires a character vector
But we put in characters! We know they’re characters. The issue is that R assumes, unless specified otherwise, that all character strings are factors, when you use data frames. There are a few solutions to this:
- Use a tibble instead of a data frame – tibbles do not assume you want factors when you pass in character strings.
- Use “stringsAsFactors = FALSE” when you create or import a data frame.
- If you’ve already imported something that’s become a factor, and you want it to be a character, replace the column with as.character(columnName) in your data frame.
Another example is type instability. Examples here are adapted from ones given by Hadley Wickham.
Generally, in any script or software, we want to know the type of data we’ll get back from a command. That data type should be stable – otherwise it’s really hard to keep going in your script. If sometimes you get a vector, and sometimes you get a data frame, depending on circumstances of the data, it’s really tricky to have a single way to continue your code. The problem with base R is that sometimes, it’ll try to give you the simplest data type possible, not a single consistent data type. This is meant to make things easier (after all, a vector is easier to work with than a data frame), but it makes it hard to make a reproducible data flow that will always work the same way.
Here’s an example. Let’s say I have a data frame that has an integer column, a decimal column, a time stamp column, and a factor (categorical) column. I’ll create this, with just one row:
df <- data.frame(intVar = 1L, decVar = 1.5, timeVar = Sys.time(), factorVar = ordered(1))
Let’s use sapply to apply the class() command to each column.
If you’re not familiar with sapply, it’s a method of applying an R command to more than one element of a complex data structure like a list or data frame. It’s closely related to lapply (list apply), but simplifies the output if possible. That “if possible” is its strength and also its downfall, as it’s unpredictable. Sometimes sapply won’t be able to get the results simpler than a list (arguably, the hardest data structure to work with, as lists can be nested and can be hard to index into). Other times, sapply will be able to return a matrix or a vector. It’s hard to tell ahead of time what sapply will return!
What does applying class() to our dataframe return? What if I just take a couple of columns and apply class() only to those? Let’s see!
df[1:2] %>% sapply(class) a b "integer" "numeric"
Our output here is a character vector.
> df[3:4] %>% sapply(class) y z [1,] "POSIXct" "ordered" [2,] "POSIXt" "factor"
Our output here is a character matrix.
> df %>% sapply(class) $a  "integer" $b  "numeric" $y  "POSIXct" "POSIXt"
Our output here is a list.
Three almost identical commands, three different kinds of outputs.
So, sapply is not “type stable”. There are a few solutions we could consider to the type instability of sapply. All of them have drawbacks.
- Have very predictable data, where you know ahead of time what sapply will return (not necessarily useful if you’re doing data exploration!).
- Use lapply() so that you always get a list (annoying if you hate working with lists).
- Use the purrr package (part of the tidyverse) to get predictable outputs (but you’ll get an error if the data can’t fit into the predictable output format you chose).
Here’s another type unstable R command: the bracket. We all use the  symbols to subset data, but using  can return unpredictable data! In the following example, using  might return a data frame, or it might return a vector.
Let’s say we have a data frame like the following:
df <- data.frame(meas1 = c(88.5,99,95.2,87), meas2 = c(98.8, 89.7, 100, 90.3), qual1 = c("good", "fair", "poor", "good"))
And we want to return a subset of the data based on the column name. We’ll use a logical grep here to match on column names.
> df[ ,grepl("meas",names(df))] meas1 meas2 1 88.5 98.8 2 99.0 89.7 3 95.2 100.0 4 87.0 90.3
This returns a data frame.
> df[ ,grepl("2",names(df))]  98.8 89.7 100.0 90.3
This returns a vector.
> df[ ,grepl("3",names(df))] data frame with 0 columns and 4 rows
This returns a data frame with 0 columns (!)
There are a couple of easy ways to get a predictable output from , so that you always get a data frame, even if it’s just one column.
- Use keep() as a subset method, instead of , like this:
- Use a tibble, not a data frame, to store your data. Subsetting a tibble will always return a tibble.
- Use drop=FALSE to indicate that you don’t want to drop dimensions:
> df[ ,grepl("meas",names(df)), drop=FALSE] meas1 meas2 1 88.5 98.8 2 99.0 89.7 3 95.2 100.0 4 87.0 90.3 > df[ ,grepl("2",names(df)), drop=FALSE] meas2 1 98.8 2 89.7 3 100.0 4 90.3 > df[ ,grepl("3",names(df)), drop=FALSE] data frame with 0 columns and 4 rows