Joy Payton
Joy Payton
7 min read

Sparklines in ggplot2

Creating a Sparkline Visualization in ggplot2

Here’s how to come up with a faceted (multiple graphs showing “facets” of the data by some category) ggplot 2 data visualization that uses a sparkline approach. The final product will look like this:

Sparklines plot

This was from an assignment in my grad data viz class and I think it could be useful for others as well. We want a very simplified, relative frequency graph that shows building trends (so, when 20+ floor buildings spiked, etc.). We’re not interested in comparing counts across height categories (for example, there are many more 1-10 floor buildings than 60+ floor buildings, but showing that comparison would mean our y scale would be very tall and detail would be hard to make out). Nope, we’re interested in relative change within categories. This tutorial will show you lots of detail about ggplot 2 and how to use it well!

We load necessary libraries:

library(dplyr)
library(ggplot2)

Then we obtain data – I’ve already obtained this from NYC public data and limited it to just the year built and number of floors, and added it to my github. It’s over 800k rows, so it takes awhile to load!

all_boroughs<-read.csv("https://raw.githubusercontent.com/pm0kjp/datastore/master/NYC_building_data.csv")

First, let’s eyeball the data before moving further on:

table(all_boroughs$YearBuilt)


    0  1661  1665  1706  1729  1765  1779  1780  1785  1798  1799  1800  1801  1802  1804  1805 
43388     1     1     1     1     1     1     1     2     1     1   166     1     1     1     1 
 1812  1814  1816  1821  1822  1823  1824  1825  1826  1827  1829  1830  1831  1832  1833  1834 
    1     1     1     2     3     1     5     3     5     3    20    11     6     5     1     7 
 1835  1836  1837  1838  1839  1840  1841  1842  1843  1844  1845  1846  1847  1848  1849  1850 
    6     7     2     5     6    21     9    11    11    18    19     8     9    13    19    43 
 1851  1852  1853  1854  1855  1856  1857  1858  1859  1860  1861  1862  1863  1864  1865  1866 
    8    14    12     6    13    11     6     7     9    35     3     2     3     2     8     6 
 1867  1868  1869  1870  1871  1872  1873  1874  1875  1876  1877  1878  1879  1880  1881  1882 
    4     8    12    57    19     5     4     4    15     2     1     3     3   160    49    11 
 1883  1884  1885  1886  1887  1888  1889  1890  1891  1892  1893  1894  1895  1896  1897  1898 
    8    12    25    18    15    18    17   427    49    13    14    12    51    42    19    29 
 1899  1900  1901  1902  1903  1904  1905  1906  1907  1908  1909  1910  1911  1912  1913  1914 
24945  9606 25713   135   182   273  7634   755   651   351   878 46787   707   721   579   601 
 1915  1916  1917  1918  1919  1920  1921  1922  1923  1924  1925  1926  1927  1928  1929  1930 
16451   650   517   254   309 91615  1158  1166  1636  2134 70860  3150  3277  4351  2133 77307 
 1931  1932  1933  1934  1935  1936  1937  1938  1939  1940  1941  1942  1943  1944  1945  1946 
32678  1238   967   372 25473   642   667   763   968 38373   746   324   201   177 24982   343 
 1947  1948  1949  1950  1951  1952  1953  1954  1955  1956  1957  1958  1959  1960  1961  1962 
  366   604   750 47901   764   692   568   647 26912   764   949   810   927 37219   797   921 
 1963  1964  1965  1966  1967  1968  1969  1970  1971  1972  1973  1974  1975  1976  1977  1978 
 1100   947 19659   616   617   631   507 17149   499   631   850   855  9747   657   728  1004 
 1979  1980  1981  1982  1983  1984  1985  1986  1987  1988  1989  1990  1991  1992  1993  1994 
  628  5465   553   778  1163  1333  3478  3742  3803  3597  3612  3508  2264  3173  2460  2130 
 1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010 
 2560  3478  3165  2763  3727  4192  4544  4189  4450  6537  6016  4858  4107  2923  1818  1204 
 2011  2012  2013  2014  2015  2016  2040 
 1412  1567  1393  1704  1557    43     1 

Sheesh, we have a lot build in year “0”. Let’s remove these!

all_boroughs<-all_boroughs %>% filter(YearBuilt!=0)

That’s better. Now let’s limit to post-1800, because we don’t care about really old building trends!

all_boroughs<-all_boroughs %>% filter (YearBuilt >=1800)
Let's look again.

table(all_boroughs$YearBuilt)


 1800  1801  1802  1804  1805  1812  1814  1816  1821  1822  1823  1824  1825  1826  1827  1829 
  166     1     1     1     1     1     1     1     2     3     1     5     3     5     3    20 
 1830  1831  1832  1833  1834  1835  1836  1837  1838  1839  1840  1841  1842  1843  1844  1845 
   11     6     5     1     7     6     7     2     5     6    21     9    11    11    18    19 
 1846  1847  1848  1849  1850  1851  1852  1853  1854  1855  1856  1857  1858  1859  1860  1861 
    8     9    13    19    43     8    14    12     6    13    11     6     7     9    35     3 
 1862  1863  1864  1865  1866  1867  1868  1869  1870  1871  1872  1873  1874  1875  1876  1877 
    2     3     2     8     6     4     8    12    57    19     5     4     4    15     2     1 
 1878  1879  1880  1881  1882  1883  1884  1885  1886  1887  1888  1889  1890  1891  1892  1893 
    3     3   160    49    11     8    12    25    18    15    18    17   427    49    13    14 
 1894  1895  1896  1897  1898  1899  1900  1901  1902  1903  1904  1905  1906  1907  1908  1909 
   12    51    42    19    29 24945  9606 25713   135   182   273  7634   755   651   351   878 
 1910  1911  1912  1913  1914  1915  1916  1917  1918  1919  1920  1921  1922  1923  1924  1925 
46787   707   721   579   601 16451   650   517   254   309 91615  1158  1166  1636  2134 70860 
 1926  1927  1928  1929  1930  1931  1932  1933  1934  1935  1936  1937  1938  1939  1940  1941 
 3150  3277  4351  2133 77307 32678  1238   967   372 25473   642   667   763   968 38373   746 
 1942  1943  1944  1945  1946  1947  1948  1949  1950  1951  1952  1953  1954  1955  1956  1957 
  324   201   177 24982   343   366   604   750 47901   764   692   568   647 26912   764   949 
 1958  1959  1960  1961  1962  1963  1964  1965  1966  1967  1968  1969  1970  1971  1972  1973 
  810   927 37219   797   921  1100   947 19659   616   617   631   507 17149   499   631   850 
 1974  1975  1976  1977  1978  1979  1980  1981  1982  1983  1984  1985  1986  1987  1988  1989 
  855  9747   657   728  1004   628  5465   553   778  1163  1333  3478  3742  3803  3597  3612 
 1990  1991  1992  1993  1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005 
 3508  2264  3173  2460  2130  2560  3478  3165  2763  3727  4192  4544  4189  4450  6537  6016 
 2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2040 
 4858  4107  2923  1818  1204  1412  1567  1393  1704  1557    43     1 

So we know we have one property that was built in the future (2040)! Argh. Let’s remove that.

all_boroughs<-all_boroughs %>% filter(YearBuilt <=2016)

Let’s get an overview of storeys:

summary(all_boroughs$NumFloors)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   2.000   2.000   2.435   2.500 119.000 

table(all_boroughs$NumFloors)


     0    0.5      1   1.01   1.08    1.1   1.15    1.2   1.25   1.33   1.35    1.4    1.5 
  1068      1  78843      1      1      2      1      2     50      3      1      1  23751 
   1.6   1.65   1.66   1.67    1.7   1.75   1.76   1.78    1.8   1.85   1.87    1.9   1.99 
    18      1      4   8620      1   5090      4      1      2     20      1      8      1 
     2    2.1    2.2   2.25    2.3   2.33    2.4   2.45    2.5   2.55    2.6   2.66   2.67 
414901      1      1     77      5     20      2      1  81966      1      2      2    233 
   2.7   2.75   2.77    2.8   2.85   2.87    2.9   2.99      3   3.25   3.32   3.33    3.4 
     2  13443      1      1      3      1      1      1 119695      9      1      3      2 
   3.5    3.6   3.67   3.75      4   4.25    4.3   4.33    4.5   4.75      5   5.09   5.25 
  1030      2     12    148  27417      6      9      1    231     17  16639      1      1 
   5.5   5.75      6   6.25    6.5   6.75      7    7.5      8    8.5      9    9.5     10 
    62      4  12166      4     40      1   2199     12   1069      4    699      5    550 
  10.5     11   11.5     12   12.5     13   13.5     14   14.5     15   15.5     16   16.5 
     4    430      3    913      3    480      1    424      2    412      3    481      6 
    17     18     19     20   20.5     21     22     23     24     25     26     27     28 
   232    172    209    284      1    188    109     90     85     77     62     59     54 
  28.5     29     30     31     32     33     34     35     36     37     38     39     40 
     1     37     65     60     65     55     36     50     37     23     27     29     29 
    41     42     43     44     45     46     47     48     49     50     51     52     53 
    29     38     23     20     20     14     11     14      7     15     14      8      9 
    54     55     56     57     58     59     60   60.5     61     62     63     64     65 
    10      1      2     10      7      4      4      1      2      1      2      2      1 
    66     67     68     69     70     71     73     75     76     77     78     82     85 
     5      1      2      1      5      1      7      1      1      1      2      1      1 
    88     90    104    114    119 
     4      2      1      1      1 

We’ve got some “0” floors and one “half a floor”. Remove those bogus data.

all_boroughs<-all_boroughs %>% filter(NumFloors >= 1)

We want to get a quick, understandable grasp of when NYC buildings of certain types really took off (when did 30+ storey buildings become popular?). We’re interested in quick understanding of trends, NOT numerical precision.

I think we want to use spark lines by tens of storeys. Grateful to http://stackoverflow.com/questions/35434760/sparklines-in-ggplot2 for the idea and implementation details!

Let’s create a new column that determines the general height in tens of floors, so we have a value for 0-9 floors, 10-19, 20-29, etc. First, divide the # of floors by 10 and just keep the whole number, discarding any fractional part (so, 3 floors gives us 0, and 15 floors gives us 1). This gives us some categories.

all_boroughs$height_type<-trunc(all_boroughs$NumFloors/10)

Then give understandable labels to each type.

all_boroughs$height_type[which(all_boroughs$height_type>=6)]<-"60+"
all_boroughs$height_type[which(all_boroughs$height_type==0)]<-"0-9"
all_boroughs$height_type[which(all_boroughs$height_type==1)]<-"10-19"
all_boroughs$height_type[which(all_boroughs$height_type==2)]<-"20-29"
all_boroughs$height_type[which(all_boroughs$height_type==3)]<-"30-39"
all_boroughs$height_type[which(all_boroughs$height_type==4)]<-"40-49"
all_boroughs$height_type[which(all_boroughs$height_type==5)]<-"50-59"

Get a basic table that shows for each year and height type, how many buildings we have: We’re going to use two dplyr tools here. The first is “group_by”, which, well, groups data. The second is “summarise” (or “summarize” – you can use either English variant), which allows you to get summary statistics like the number, mean, etc. on a per-group basis. The %>% is another dplyr tool that allows you to chain together commands, so that the output of one becomes the input to the next.

storey_trend<-all_boroughs %>% 
              group_by(height_type, YearBuilt) %>% 
              summarise(count=n())

Take a quick peek at our summary table – much more wieldy at only 646 rows!

head(storey_trend)


Source: local data frame [6 x 3]
Groups: height_type [1]

  height_type YearBuilt count
        <chr>     <int> <int>
1         0-9      1800   164
2         0-9      1801     1
3         0-9      1802     1
4         0-9      1804     1
5         0-9      1805     1
6         0-9      1812     1


tail(storey_trend)


Source: local data frame [6 x 3]
Groups: height_type [1]

  height_type YearBuilt count
        <chr>     <int> <int>
1         60+      2008     3
2         60+      2009     2
3         60+      2012     2
4         60+      2013     3
5         60+      2014     6
6         60+      2015    16

Plot some spark lines, using the count.

ggplot wants (data_you_use, aes(elements_from_that_data_to_plot)). Then you have to add layers to say how to plot it. The plus sign allows you to add layers. So we start simply:

ggplot(storey_trend, aes(x=YearBuilt, y=count) ) + 

We want to divide up our graph into several different graphs, one per kind of height (tens of floors). For that, we’ll use facet_grid.

Facet grid wants to know how you want to plot it by rows ~ columns. We want a row each for each kind of height, and just one column that includes “everything” (aka the dot, which usually means “everything else” when it’s after the tilde ~ symbol). Facet grid also wants to know your scales: do you have the same x and y scale, or can each graph have its own? In our case, we want the time, on the X-axis, to be the same, but we don’t need y to match. Y can be “free”:

facet_grid(height_type ~ ., scales = "free_y") + 

But we still haven’t told ggplot what kind of graph we want! Line? Point? Histogram? Boxplot? What geometric element (geom) do we want?

geom_line()

First attempt at sparklines plot

Ok, this is good, but the time series is too jagged. We should have our unit years-in-decades, maybe. Let’s try that! We can do a similar thing to what we did with floors – lop off the last digit of the year and replace it with zero to get the decade.

all_boroughs$decade<-trunc(all_boroughs$YearBuilt/10)\*10

Get a basic table that shows for each year and height type, how many were built:

storey_trend_decade<-all_boroughs %>% 
                    group_by(height_type, decade)  %>% 
                    summarise(count=n())

Plot some spark lines, using the count:

ggplot(storey_trend_decade, aes(x=decade, y=count) ) + 
  facet_grid(height_type ~ ., scales = "free_y") + 
  geom_line()

Improved attempt at sparklines plot

This is much more understandable! Let’s make the plot look nicer. But first, how about some 1st-3rd quantile sectioning, to show where building rates were on the low / high side? We can use this data to create a horizontal ribbon in each facet of our graph.

storey_trend_decade <-  storey_trend_decade %>%
                        group_by(height_type) %>%
                        summarize(quart1 = quantile(count, 0.25),
                                  quart3 = quantile(count, 0.75)) %>%
                        right_join(storey_trend_decade)

head(storey_trend_decade)


# A tibble: 6 x 5
  height_type quart1 quart3 decade count
        <chr>  <dbl>  <dbl>  <dbl> <int>
1         0-9 130.25  58017   1800   168
2         0-9 130.25  58017   1810     3
3         0-9 130.25  58017   1820    39
4         0-9 130.25  58017   1830    56
5         0-9 130.25  58017   1840   137
6         0-9 130.25  58017   1850   128


tail(storey_trend_decade)


# A tibble: 6 x 5
  height_type quart1 quart3 decade count
        <chr>  <dbl>  <dbl>  <dbl> <int>
1         60+    1.5      8   1930     5
2         60+    1.5      8   1940     1
3         60+    1.5      8   1970     1
4         60+    1.5      8   1980     4
5         60+    1.5      8   2000    11
6         60+    1.5      8   2010    27

Great, let’s do the new ggplot!

First line: data we’re using and “aesthetics” (variables) we’re using to plot

ggplot(storey_trend_decade, aes(x=decade, y=count) ) +

Sectioning our graph into several independent graphs based on a category

facet_grid(height_type ~ ., scales = "free_y") +

Add a layer (a ribbon) that covers the area from 1Q to 3Q for each facet. Label this ribbon in the legend with “1-3 Quartile”.

geom_ribbon(aes(ymin = quart1, max = quart3, fill="1-3 Quartile")) +

And make sure that ribbon is a light grey.

scale_fill_manual(values=c('#eeeeee')) +

On top of that ribbon, plot the line (just as we did before!)

geom_line() +

Instead of the typical ggplot grey background, we want the “black and white” theme, which has some cosmetic improvements, in my opinion.

theme_bw() +

We want to add more years, not just the every-fifty-years decade mark, and to do that, we’ll have to tilt our X-axis labels:

theme(axis.text.x = element_text(angle = 60, hjust = 1)) +

Which labels do we want? A label for every single decade we have in our data. If it’s missing in our data, leave it out of the axis labels, too.

scale_x_continuous(breaks = unique(storey_trend_decade$decade)) +

But with so many years, it can be hard to follow a single light line all the way up from the bottom of the graph to the top facet. Let’s plot a few vertical lines in a transparent red to help guide the eye. We’ll do them from 1800 to 2000, in 50 year increments.

geom_vline(xintercept=seq(1800, 2000, 50), color="red", alpha=0.3) +

We don’t care about the y values at all, they just clutter the graph, so remove the y axis labels altogether!

scale_y_continuous(breaks = NULL) +

Give the graph a good title, but remove the x and y axis labels, because the title will explain everything quite well.

labs(title="Structures Built in NYC\nby Decade and Number of Floors", x="", y="") +

Finally, remove the legend title also:

theme(legend.title=element_blank())

Final sparklines plot

Looks great!