Elsewhere in the Arcus Education Portal you will find Descriptive Statistics: The Bullet. It is a simple, straightforward summary of how to present your variables to yourself and others. The article you are reading now is the beginning of a series in which we detail the theory and practice of making statistical methodology decisions and carrying them out. These two articles represent two approaches to statistical analysis at CHOP.
Our simple, straightforward summaries of statistical methods are useful to anyone who wants to get the job done without taking extra time to consider background concepts. It is a good approach: Follow simple directions without delving too far into the theory because it’s not necessary to delve in order to get a lot of good information from the output of a few basic R functions and some
ggplot visualizations (see Joy Payton’s article about
ggplot if you would like to learn more about that).
Continue on here to begin hearing the numbers tell you their stories; or predicting the future based on the past; or finding the answers to gnarly puzzles others have given up on; or to ask “Why?” and “Why?” and “Why?” again; or to let your curiosity out of the bag to run rampant among universes of multidimenstional data. Continue on to unlock statistics.
Surprisingly, the term statistic first came into use as late as 1817. This is from the Oxford English Dictionary:
statistic. A fact or a piece of data obtained from a study of a large quantity of numerical data.
The term came to English from the German (where it lived before that I do not know) and seems to have emerged as a way of explaining aggregated data, or data which one has subjected to the process of removing information in order to gain information. An example of aggregating data is the simple process of finding the mean of a variable such as height or weight. By focusing on the mean, we lose the individual values, but we gain knowledge of the “average.” Nowadays we assume this is a valid process. It was not always accepted, however. See The Argument Against Aggregation for Claude Bernard’s forceful statements in favor of looking only at individuals rather than at “meaningless” statistics.
Populations vs. Samples
On March 19th, 2018, the world’s last male northern white rhinocerous, Sudan, died at the age of 45. Two female rhinos remain: his daughter Najin and his granddaughter Fatu. If we want to find the average population weight of northern white rhinos, we would weigh Najin and Fatu, add their weights together, and divide by the number of rhinos we weighed: 2.
I tried hard to find an estimate of the number of cockroaches in New York City. I couldn’t find any. Then I simplified my Google search to “How many cockroaches are there” and found this article by Larry Yundelson (2009), which contains the rather chilling statement, “It is nearly impossible to calculate the number of cockroaches that exist worldwide due to the fact that so many already exist and are reproducing at such a fast pace.” An unidentified source suggests, Yundelson continued, that there are an average of 36,000 cockroaches in each building in some parts of America. And before you ask, he did not say which parts.
Another hit was a Quora answer to the question written by Carlos Ruano (whose status was “Killed some roaches”):
There is probably an average 1000 cockroaches per acre of wooded and edge ecosystems. There is probably an average of 100,000 cockroaches in each average sized New York City studio apartment. There might be a baby cockroach in your ear right now. Basically, there are a lot of f’ng cockroaches in the world.
Without major advances in technology, we will never know the population mean of the length of, say, American cockroaches. We do know the sample mean, derived from I don’t know who measuring I don’t know how many American cockroaches: 1.5 inches. The sample statistic is always derived from fewer than the total number of subjects in a population. The population statistic is always derived from measuring every single member of a subject population. We could get a sample mean weight of northern white rhinos, then, by weighing Najin—and we would have a wrong idea of the species generally, because Sudan, the last male, was much larger than his daughter or his granddaughter.
Measures of Central Tendency in a Population
Now we begin with what most people think of when you say “descriptive statistics” to them (if you just thought “Nee!” you belong here). I could reinvent the wheel and tell you all about descriptive statistics in an upbeat yet soothing manner, but Khan Academy beat me to it. I can’t improve on their video discussion of measures of central tendency, so here it is.
Measures of Dispersion in a Population
Knowing how and where a particular variable’s data points clump together is important to describe to yourself and your audience, as we have just seen. We balance that information with knowledge of how data spread out. The following videos (again Khan Academy has created the best description) show why and how we calculate measures of dispersion.
By the end of that video you may be wondering just how useful a number that is obviously bigger than any spread we can see between the data points is—which is why I recommend you go right on and watch the second video, a discussion of the population standard deviation. The standard deviation is an extremely useful number, especially when accompanied by a measure of central tendency.
Measures of Central Tendency and Dispersion in a Sample
The methods described in the next two videos make more sense when you want to show descriptive statistics for cockroaches rather than rhinos.
Sal left a cliffhanger at the end of the previous video: How do we get an unbiased estimate of the population standard deviation?
Coming Up Next
Now that you have a good idea of how measures of central tendency such as the mean, median, and mode, along with measures of dispersion such as the variance and the standard deviation can help you and your audience understand data without having to look at every data point, let’s move on to ways to understand descriptive statistics for non-scalar variables.