Statistical Intervals and Visualizations: Difference Between Means

Geoff Cumming’s 2014 article, “The New Statistics, Why and How,” in Psychological Science, 2014, Vol. 25(1) 7–29, outlines an eight-step statistical strategy for improving the reproducibility of research studies. Chief among Cumming’s points is that point estimates are not sufficiently rigorous or informative and that null-hypothesis testing is prone to failure to reproduce. He includes several data visualizations that include intervals such as error bars, and in a series of articles, I show you how to reproduce these visualizations. This article takes on the Differences Between Means visualization, Cumming’s “Figure 1”.

This graphing code is very complex. If you want to see it built up bit by bit, starting with point estimates, and gradually adding more elements, that will definitely help you with ggplot2. Go check it out at my Rpubs page for this article.

If you’d rather just download a package that handles this, I wrote one. Skip ahead and find out more about that!

Feeling gutsy? Run the whole thing below!

Let’s take a look at the first of Cumming’s visualizations:

Cumming Figure 1

Cumming’s first figure is a demonstration of the statistical principles underlying what confidence intervals are: most intervals shown contain the actual mean, but a couple do not. While this particular plot does not apply to research data (in which the actual population difference is unknown and the inference of the difference is the whole point), the visualization has lots of elements that can be very helpful. Here’s how to create a plot like this in R, using ggplot2. We’ll start with creating two sets of fake data – one which imitates the mean differences between two populations (the top graph) and one which mimics a perfect normal distribution (the bottom graph). Once we have our fake data generated, we can plot it!

Generate Fake Data (Top Plot)

Let’s generate some fake data and draw some samples, shall we?

set.seed(42)
myControls <- data.frame(value=rnorm(2000, mean = 30, sd=20))
myData <- data.frame(value=rnorm(2000, mean = 40, sd=20))
cases <- NULL
controls <- NULL
for (i in 1:25) {
  cases[[i]] <- sample(myData$value, size=32)
  controls[[i]] <- sample(myControls$value, size=32)
}

Now we have 25 samples of n=32 from cases and 25 samples of n=32 from controls, each drawn out of normally distributed datasets with mean difference between them of 10 and standard deviation of 20.

Point and Interval Estimates (Top Plot)

How can we plot our sample point estimates (i.e. mean) and 95% CI? Well, first, let’s calculate them!

I’m assuming a T distribution for calculating our standard error, even though we happen to know the population standard deviation. This is so odd and unlike typical research that I’d rather just do a T test, since that’s what we do in real life.

Still, you could do the Z if you wanted to.

I’ll start by making an estimates data frame. We have 25 samples of cases and 25 samples of controls, so I’ll do pairwise comparisons – we’ll have 25 rows in estimates.

  • I’ll set “upper” and “lower” – which will be the edges of my 95% confidence interval – to just have NA values (for now)
  • “meanDiff” will be the sample mean differences from corresponding rows of cases and controls (so mean of cases minus mean of controls for each row)
  • “caseSSE” will be the sum of squared errors from the cases in each sample
  • “controlSSE” will be the sum of squared errors from the controls
  • “sd”, blank for now, will be the pooled sample standard deviation
estimates <- data.frame(lower=NA, 
               meanDiff=sapply(cases, mean)-sapply(controls, mean),
               caseSSE= sapply(cases, function(x) sum((x-mean(x))^2)),
               controlSSE = sapply(controls, function(x) sum((x-mean(x))^2)),
               sd=NA,
               upper=NA)

Now that I’ve got my SSE in place, I can come up with the pooled sd quite easily:

estimates$sd <- sqrt((estimates$caseSSE+estimates$controlSSE)/64)

Now I have to figure out the standard error for each sample.

se <- estimates$sd/sqrt(32)  # basic stats here

Here I choose my significance level to figure out how many se’s I want to go out in either direction.

# we have two tails, so we want the 97.5%, not the 
# 95%, quantile (that gives us 2.5% at the end) to get
# the number of standard deviations away (T-statistic).

tBound <- qt(0.975, df=31)

# if you wanted the Z-statistic, you could:

zBound <-qnorm(0.975)

estimates$lower <- estimates$meanDiff - se*tBound # or zBound
estimates$upper <- estimates$meanDiff + se*tBound # or zBound

I’ll also set a “problem” variable in my dataset which will be TRUE if the upper CI limit is below the true mean or the lower CI limit is above the true mean. Otherwise, it’ll be false.

estimates$problem = estimates$lower >10 | estimates$upper < 10

Significance (Top Plot)

Let’s also consider what our p-values are. There are a couple of ways to do this. Lets use t.test() and extract the p-values from the list (a bit messy, but worth it, because it handles the two-tailed thing easily, unlike some other R commands!). Keep in mind that our null hypothesis is that the mean is 0. Since our actual population mean is 10 (not that we know this in real life), we expect that we’ll find some evidence to discard our null hypothesis on at least some samples. Let’s see if that’s the case.

tTest <- mapply(t.test, x=controls, y=cases)
# flip it and make it a data frame
tTest <- as.data.frame(t(tTest))
estimates$p <- unlist(tTest$p.value)
estimates$p <- round(estimates$p, 4)

Let’s also add significance indicators:

estimates$significance <- ""
estimates$significance[estimates$p<.05] <- "*"
estimates$significance[estimates$p<.01] <- "**"
estimates$significance[estimates$p<.001] <- "***"

To plot each sample on its own line, we’ll note which sample is which by adding a sampleNum column in our data frame.

estimates$sampleNum <- as.numeric(row.names(estimates))

Generate Fake Data (Bottom Plot)

Let’s plot a “perfect” sampling distribution of what we’d expect if we sampled not 25, but infinitely many samples of size 32 out of an infinite pair of populations, each with a sd of 20 and a difference of population means of 10.

In actuality, we’ll do a random collection of one million. That’s pretty darn close!

popDifferenceSE <- sqrt(20^2+20^2)/sqrt(32)
fakeData<-data.frame(value=rnorm(1000000, mean=10, sd=popDifferenceSE))

Graph it!

# get libraries we need for plotting and stacking the plots

library(ggplot2)
library(grid)
library(gridExtra)

# set colors
problemColors <- c("TRUE"="red", "FALSE"="darkgrey")
colorScale <- scale_colour_manual(name="problem", values=problemColors)

# set the data and the most used elements
finalTop <- ggplot(data=estimates, aes(x=meanDiff, y=sampleNum)) +
  
  # add error bars, parameterized by other columns of 'estimates'
  geom_errorbarh(aes(xmin=lower,xmax=upper, color=problem)) +
  
  # add point estimate, colored according to the problem column of 'estimate'  
  geom_point(aes(color=problem))  + 
  
  # draw a vertical line at the true difference in mean.
  geom_vline(xintercept = 10, color="blue") +
  
  # flip the y axis
  scale_y_reverse() + 
  
  # Add the text for the mean value at the same x value (so a vertical
  # column), but with the y value matching the sample.  Round to 2 decimal pts.
  geom_text(aes(x=-24, y=sampleNum, 
                label=paste("Mean:", round(meanDiff,2))),
            size = 2.5, hjust="inward") +
  
  # Add the text for the p value at the same x value (so a vertical
  # column), but with the y value matching the sample.  Round to 2 decimal pts.
  geom_text(aes(x=-17, y=sampleNum, 
                label=paste("p: ",
                            p, sep="")),
            size = 2.5, hjust="inward") +
  
  # make sure the graph is wide enough to include the text we added  
  scale_x_continuous(limits=c(-26,30)) +
  
  # color the "problem" status according to a scale we set up separately
  colorScale +
  
  # Get rid of gridlines, axes
  theme_void() +
  
  # Some theme changes. Get rid of the legend.
  theme(legend.position = "none",
        
  # Position the plot title at 65% of the width, just over our true mean diff.
        plot.title = element_text(hjust = 0.65),
  
  # Also make sure there's space at the top, but none on bottom!
  plot.margin = unit(c(10, 0, -2, 0), "pt")) +
        
  # Add a title to the graph
  ggtitle("Mean, 95% CI") +
  
  # Add the significance asterisks, drawn from a column in 'estimates'
  geom_text(aes(x=-12, y=sampleNum, 
                label=estimates$significance),
            size = 2.5, hjust="inward") 

# set the data and the most used elements
finalBottom <- ggplot(data = fakeData, mapping = aes(value)) +
  
  # make sure the graph is wide enough to match the top graph 
  scale_x_continuous(limits=c(-26,30)) +
  
  # We want to draw a density curve
  geom_density() +
  
  # draw a vertical line at the true difference in mean.
  geom_vline(xintercept = 10, color="blue") +
  
  # start with a minimal theme.  We can't use theme_void bc it 
  # clobbers the axis numbers, which we want.
  theme_minimal() +
  
  # Some theme changes. Put a light grey line in for major vertical grid lines
  theme(panel.grid.minor.x = element_blank(),
        panel.grid.major.x = element_blank(),
        panel.grid.minor.y = element_blank(),
        panel.grid.major.y = element_blank(),
        
        # Remove all the axis text except the x values
        axis.title.y=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks.y=element_blank(),
        axis.title.x=element_blank(),
        axis.ticks.x=element_blank(),
        
        # Also make sure there's space at the bottom, but none on top!
        plot.margin = unit(c(-2, 0, 10, 0), "pt"))
  

finalComplete <- grid.arrange(finalTop, finalBottom, 
                              ncol = 1, heights = c(4, 1))

This is what the final visualization should look like:

ggplot2 visualization

Using Joy’s Package to do Point + Interval Means Estimate Plotting

I know that the above code is hard to implement and remember, so I wrote a very simple R package that will minimally plot your point and interval estimates, for one or more samples and a given confidence level. You can optionally color points and/or error bars as well as flip the y axis. Let’s try it!

First, we’ll load my package, called ciPlotter. You’ll need to have installed the devtools package.

library(devtools) # you must use devtools to install from a GitHub clone
install_github("paytonk/RPackages/ciPlotter", 
               host="https://github.research.chop.edu/api/v3", quiet=TRUE)
library(ciPlotter)

There’s only one (public) function there, plotEstimates. Let’s see what it does. First, let’s plot the sample estimates for “cases”, with no other parameters specified, then using the additional parameters to tailor the plot’s appearance.

Note that in this case, we are passing plotEstimates a list that consists of numerical vectors.

plotEstimates(cases)

point and interval estimates visualization

plotEstimates(cases, 
              confLevel = 0.80, 
              pointColor = "blue", 
              ciColor = "brown",
              yaxis = "reverse")

point and interval estimates visualization

If you don’t have a list of numerical vectors, but just have one sample, you can just pass in a single numerical vector, as long as it has more than one number in it:

plotEstimates(c(10,9,8,12,12,19,10,15))

point and interval estimates visualization

It’s not a full or perfect solution, but it can help you visualize your point and confidence interval estimates in a jiffy. If you like and use this package, let me know!