Writing Functions in R
Why write functions?
If you’re reading this, it’s probably because you’re not a coder by trade – you are a scientist, clinician, or analyst who works with data to find answers to questions about health, biology, behavior, etc. Writing code may be one more tool in your arsenal, but you wouldn’t define yourself as a “programmer”. Why, then, should you work to make functions? Why not just do analysis from start to finish, in a specialized script for each dataset? That’s already a huge gain in reproducibility and transparency, compared, say, to doing analysis in Excel and doing a lot of copy-paste. And what even is a function?
Let’s start with what a function is. In math, we say that a function is a mathematical relationship such that for any given inputs (often in terms of x), we will reliably get the same output (usually y) every time the same inputs go in. So, y=sin(x) is a function, because if you plug in the same x, you’ll get the same y, every time. In economics or politics we might say “thing A can be thought of as a function of thing B”. Thing B will reliably predict thing A (less well than in pure math, but you get the idea). The same reasoning can be applied to code. A function in code is a set of instructions, typically named something short and descriptive for ease of use, that will always return the same output for any given input. In R, a function looks like this:
addWoo <- function(origText) {
newText <- paste0(origText,"Woo")
newText
}
The first line has the name of the function (addWoo), the keyword “function” and the parameters that the function needs as names within the parentheses.
The curly braces tell where the function code starts and ends. They aren’t necessary if your function is just one line of code (unlikely).
What does this function do?
Try running the function above. Nothing will happen, because all you’ve done is define a function. But if you then type
addWoo("ABC123")
you should see the output of your function. This is a trivial example, to be sure, but it shows you how to write a function, if you’ve never done so before. Why would you write a function? Mainly to segment your code (if you have hundreds of lines of data cleaning, reshaping, and analysis, it might help to make your code more modular) and to prevent repetition. In the programming world, a basic principle is “DRY” – don’t repeat yourself. If you are repeating largely similar code again and again, it probably makes sense to use a function. Again, let’s consider an example. Let’s say you have a number of data frames, each of which contains data from a single site of a multi-site study. For each data frame, you want to do the following:
- append the site name to the ID field, to disambiguate ID’s, since multiple sites could use the same internal ID for different people
- turn the values in the “sex” column to Male and Female instead of 0 and 1
- remove rows that are missing sex or ID
Here’s some code that will do that. First, let’s set up a fake data frame:
site1<-data.frame(ID=c(1,2,3,NA),
Sex=c(1,NA,0,0),
Score=c(120, 100, 95, 118))
And now, let’s do my data preparation for Site 1:
site1_cleaned <- site1
site1_cleaned <- site1_cleaned[which(!is.na(site1_cleaned$ID) &
!is.na(site1_cleaned$Sex)),]
site1_cleaned$ID <- paste0(site1_cleaned$ID, "_site1")
site1_cleaned$Sex[which(site1_cleaned$Sex == 0)] <- "Male"
site1_cleaned$Sex[which(site1_cleaned$Sex == 1)] <- "Female"
Not too hard, but what if I have to do that for three sites? I can copy and paste like this. I have to manually change “site1” to “site2” or “site3” every time. Hopefully I get all thirteen mentions, and don’t accidentally miss one! This is what that would look like – long, eye-glazingly boring, and prone to mistakes. In fact, there’s one easy to miss mistake in there:
site1_cleaned <- site1
site1_cleaned <- site1_cleaned[which(!is.na(site1_cleaned$ID) &
!is.na(site1_cleaned$Sex)),]
site1_cleaned$ID <- paste0(site1_cleaned$ID, "_site1")
site1_cleaned$Sex[which(site1_cleaned$Sex == 0)] <- "Male"
site1_cleaned$Sex[which(site1_cleaned$Sex == 1)] <- "Female"
site2_cleaned <- site2
site2_cleaned <- site2_cleaned[which(!is.na(site2_cleaned$ID) &
!is.na(site1_cleaned$Sex)),]
site2_cleaned$ID <- paste0(site2_cleaned$ID, "_site2")
site2_cleaned$Sex[which(site2_cleaned$Sex == 0)] <- "Male"
site2_cleaned$Sex[which(site2_cleaned$Sex == 1)] <- "Female"
site3_cleaned <- site3
site3_cleaned <- site3_cleaned[which(!is.na(site3_cleaned$ID) &
!is.na(site3_cleaned$Sex)),]
site3_cleaned$ID <- paste0(site3_cleaned$ID, "_site3")
site3_cleaned$Sex[which(site3_cleaned$Sex == 0)] <- "Male"
site3_cleaned$Sex[which(site3_cleaned$Sex == 1)] <- "Female"
What if I then realize that the code needs to be amended? A new step has to be added, or I got the Male / Female coding wrong, or something like that? I have to make sure to insert contextual code in each of the three code blocks. If I’m lucky, I get it right in all three, with no copy-paste errors, or putting the line in the wrong order errors.
It’s easy to see with even this trivial code where errors could occur. I could get too excited to change all 1’s to 2’s and realize only way down the line that I have no females, because I changed the coding for Female to 2. If you’re doing something 3 or more times (your tolerance for repeated code may be different than mine), it makes sense to convert that piece of code to a function.
A good function will:
- have a title that is self-documenting. So, “cleanData” instead of “function1” or “myFunction”
- be “pure” – it doesn’t affect anything outside of itself, it only produces output without messing with other objects in the R environment
- do just one thing (with your understanding of “one thing” being flexible).
So let’s consider the “one thing” (for now) to be cleaning our data frame, and include all three tasks into one. Our function and the subsequent calls to our function could look like this:
cleanSiteData <- function(siteData, sitename) {
siteData <- siteData[which(!is.na(siteData$ID) &
!is.na(siteData$Sex)),]
siteData$ID <- paste0(siteData$ID, sitename)
siteData$Sex[which(siteData$Sex == 0)] <- "Male"
siteData$Sex[which(siteData$Sex == 1)] <- "Female"
siteData
}
site1_cleaned <- cleanSiteData(site1, "_site1")
site2_cleaned <- cleanSiteData(site2, "_site2")
site3_cleaned <- cleanSiteData(site3, "_site3")
There’s still cut and paste involved, but it’s not as terrible. And it becomes a lot easier to make changes to the function than it is to code that’s located in three different places!
If my rules were more complex, with more rules for removing invalid data or for recoding data (this was, after all, a very simplistic example), I could split out my function even more, and even call functions from each other. My definition of “just one thing” has changed, so I’ll have several functions, the last of which calls the other two, like this:
removeInvalidData <- function(siteData) {
siteData <- siteData[which(!is.na(siteData$ID) &
!is.na(siteData$Sex)),]
# ...
# Other validity checks here ...
# ...
siteData
}
recodeData <- function(siteData) {
siteData$Sex[which(siteData$Sex == 0)] <- "Male"
siteData$Sex[which(siteData$Sex == 1)] <- "Female"
# ...
# Other recoding here ...
# ...
siteData
}
cleanSiteData <- function(siteData, sitename) {
siteData <- removeInvalidData(siteData)
siteData$ID <- paste0(siteData$ID, sitename)
siteData <- recodeData(siteData)
siteData
}
cleanSiteData (site1, "_site1")
When you have code that is arranged this way, with well-named functions carrying out the code that needs to be executed multiple times, you have code that you can understand more easily, even months later, reuse for new analyses, and change up or correct more easily.