Literate Statistical Programming

Literate statistical programming is just that. It’s:

  • literate – It uses natural language (like English) in an appropriate way to the context.
  • statistical – Written in a statistical language or a general purpose language with strong statistical and analytical tools (like R or Python).
  • programming – It consists of code that runs on a computer to do things to data and come up with some results.

The philosophy behind literate statistical programming is that explanations, descriptions, and scientific writing deserve to be promoted and be at the center of any statistical code. That’s why literate is the first, most important term. You can write code that does a lot of fancy calculations, but if there’s no scientific rationale or explanation accompanying the code, it’s going to age badly and not be very reproducible. Or, you could write code and put in comments, but that makes it seem that the code is the most important thing, and the scientific reasoning is optional and secondary. Literate statistical programming makes the words that we use to describe code the central focus of a script, and the code itself is secondary and supporting to the natural language.

(We won’t even consider an additional possibility, which is scientific analysis that isn’t made into code, but instead consists of a series of operations done, say, in Excel, with a how-to list that may or may not be sufficiently descriptive, up-to-date, or travel with the data itself.)

This can seem abstract. Let’s consider three ways to analyze some data using code (here, I’m using R). First, with no “literate” components at all:

data("Theoph")
head(Theoph)
theophModel <- lm(conc ~ .-Subject, Theoph)
library(broom)
tidy(theophModel)

You probably won’t understand what we’re doing here unless you’ve done something quite similar yourself, lately. This is valid code, and you could run it yourself and glean what we’re doing, but the code as it appears here doesn’t give you any outputs, doesn’t give you any hints, and although only five lines long feels pretty dense.

Now let’s consider improving this with a traditional / old-school commenting system.

# Get data about the pharmacokinetics of theophylline
data("Theoph")

# take a peek at the data
head(Theoph)

# Create a linear model where concentration is considered a 
# function of the other variables except subject identifier
theophModel <- lm(conc ~ .-Subject, Theoph)

# load library we need
library(broom)

# show linear model output in a data frame
tidy(theophModel)

This is a vast improvement! It gives context, explains things line by line, and makes it much easier to grasp what’s happening. However, adding comments like this can be annoying – you have to do things like handling line breaks yourself, so adding text in the middle is a pain. Comments can be hard to read, and we can’t add any special formatting, like headers, bulleted lists, links, and so forth. Additionally, we still don’t have outputs that help the user of the code understand what’s happening.

What about a so-called “literate” approach like using R Markdown? R Markdown allows you to put your natural language front and center, without having to use special commenting symbols, and it also allows you to put in special formatting, using standard markdown. Code is what’s set apart by special symbols, which allows you to write free-flowing text without worrying about line breaks, etc.

The best part about R Markdown is that it allows you to create a document that includes not just the code and your scientific writing, but the output as well! First, let’s look at the R Markdown script, and then we’ll look at the document it generates.

Here’s what the R Markdown looks like. Yep, it’s sort of an ugly duckling, with lots of hashes and backticks and special options at the top. But once you get the hang of it, it’s not hard to do.

---
title: "Sample Literate Statistical Programming"
author: "Ima Data-Guru"
date: "1/24/2018"
output: html_document
---

## Load a library we need
```{r message=FALSE, warning=FALSE}
library(broom)
```

## Get Data 

This is a dataset that ships with R (as part of R's utils) and contains information
 about the pharmacokinetics of theophylline.  Let's load it and look at the top 
 few rows of data:

```{r}
data("Theoph")
head(Theoph)
```

## Create a linear model 

We want to determine what factors influence drug concentration, but we will leave 
the subject identifier out of our model.   The `lm()` function gives us ugly data 
that's hard to parse computationally, but broom's `tidy()` feature allows us to see 
the data in a data frame.

```{r}
theophModel <- lm(conc ~ .-Subject, Theoph)
tidy(theophModel)
```

This is the result of “knitting” (making the R Markdown into a document). I think you’ll find it much more intuitive and able to be shared with others with a minimum of hand-holding:

Load a library we need

library(broom)

Get Data

This is a dataset that ships with R (as part of R’s utils) and contains information about the pharmacokinetics of theophylline. Let’s load it and look at the top few rows of data:

data("Theoph")
head(Theoph)
## Grouped Data: conc ~ Time | Subject
##   Subject   Wt Dose Time  conc
## 1       1 79.6 4.02 0.00  0.74
## 2       1 79.6 4.02 0.25  2.84
## 3       1 79.6 4.02 0.57  6.57
## 4       1 79.6 4.02 1.12 10.50
## 5       1 79.6 4.02 2.02  9.66
## 6       1 79.6 4.02 3.82  8.58

Create a linear model

We want to determine what factors influence drug concentration, but we will leave the subject identifier out of our model. The lm() function gives us ugly data that’s hard to parse computationally, but broom’s tidy() feature allows us to see the data in a data frame.

theophModel <- lm(conc ~ .-Subject, Theoph)
tidy(theophModel)
##          term    estimate   std.error  statistic      p.value
## 1 (Intercept) -4.45137327 23.47940955 -0.1895863 0.8499336699
## 2          Wt  0.06831789  0.18329388  0.3727233 0.7099708742
## 3        Dose  1.16715907  2.33131324  0.5006445 0.6174811240
## 4        Time -0.12571501  0.03473264 -3.6195064 0.0004237019

Want more?

If you want to read about literate programming (not just literate statistical programming) the seminal work was by Donald Knuth. Here’s the (very dense and dated) paper where he coined the phrase “literate programming”: Literate Programming (1983).

Roger Peng has a great intro video to literate statistical programming in R (including how to create R Markdown and use knitr).

Want to use literate statistical programming to do your entire paper in one place, complete with all the formatting required by the journal, embedding code and graphics, etc.? What about writing a textbook, or book chapter, or guides for your co-investigators or RA’s? Want to make presentations for conferences using just one program (instead of cut-and-paste into PowerPoint)? Check out a gallery of R Markdown uses!

At least one attempt has been made to create literate statistical programming in SAS: SASweave. It’s hard to tell if any work has been done over the past decade to allow SAS code to use a literate paradigm.