Joy Payton

# Creating a Sparkline Visualization in ggplot2

Here’s how to come up with a faceted (multiple graphs showing “facets” of the data by some category) ggplot 2 data visualization that uses a sparkline approach. The final product will look like this:

This was from an assignment in my grad data viz class and I think it could be useful for others as well. We want a very simplified, relative frequency graph that shows building trends (so, when 20+ floor buildings spiked, etc.). We’re not interested in comparing counts across height categories (for example, there are many more 1-10 floor buildings than 60+ floor buildings, but showing that comparison would mean our y scale would be very tall and detail would be hard to make out). Nope, we’re interested in relative change within categories. This tutorial will show you lots of detail about ggplot 2 and how to use it well!

library(dplyr)
library(ggplot2)


Then we obtain data – I’ve already obtained this from NYC public data and limited it to just the year built and number of floors, and added it to my github. It’s over 800k rows, so it takes awhile to load!

all_boroughs<-read.csv("https://raw.githubusercontent.com/pm0kjp/datastore/master/NYC_building_data.csv")


First, let’s eyeball the data before moving further on:

table(all_boroughs$YearBuilt) 0 1661 1665 1706 1729 1765 1779 1780 1785 1798 1799 1800 1801 1802 1804 1805 43388 1 1 1 1 1 1 1 2 1 1 166 1 1 1 1 1812 1814 1816 1821 1822 1823 1824 1825 1826 1827 1829 1830 1831 1832 1833 1834 1 1 1 2 3 1 5 3 5 3 20 11 6 5 1 7 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 6 7 2 5 6 21 9 11 11 18 19 8 9 13 19 43 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 8 14 12 6 13 11 6 7 9 35 3 2 3 2 8 6 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1881 1882 4 8 12 57 19 5 4 4 15 2 1 3 3 160 49 11 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897 1898 8 12 25 18 15 18 17 427 49 13 14 12 51 42 19 29 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 24945 9606 25713 135 182 273 7634 755 651 351 878 46787 707 721 579 601 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 16451 650 517 254 309 91615 1158 1166 1636 2134 70860 3150 3277 4351 2133 77307 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 32678 1238 967 372 25473 642 667 763 968 38373 746 324 201 177 24982 343 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 366 604 750 47901 764 692 568 647 26912 764 949 810 927 37219 797 921 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1100 947 19659 616 617 631 507 17149 499 631 850 855 9747 657 728 1004 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 628 5465 553 778 1163 1333 3478 3742 3803 3597 3612 3508 2264 3173 2460 2130 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2560 3478 3165 2763 3727 4192 4544 4189 4450 6537 6016 4858 4107 2923 1818 1204 2011 2012 2013 2014 2015 2016 2040 1412 1567 1393 1704 1557 43 1  Sheesh, we have a lot build in year “0”. Let’s remove these! all_boroughs<-all_boroughs %>% filter(YearBuilt!=0)  That’s better. Now let’s limit to post-1800, because we don’t care about really old building trends! all_boroughs<-all_boroughs %>% filter (YearBuilt >=1800) Let's look again. table(all_boroughs$YearBuilt)

1800  1801  1802  1804  1805  1812  1814  1816  1821  1822  1823  1824  1825  1826  1827  1829
166     1     1     1     1     1     1     1     2     3     1     5     3     5     3    20
1830  1831  1832  1833  1834  1835  1836  1837  1838  1839  1840  1841  1842  1843  1844  1845
11     6     5     1     7     6     7     2     5     6    21     9    11    11    18    19
1846  1847  1848  1849  1850  1851  1852  1853  1854  1855  1856  1857  1858  1859  1860  1861
8     9    13    19    43     8    14    12     6    13    11     6     7     9    35     3
1862  1863  1864  1865  1866  1867  1868  1869  1870  1871  1872  1873  1874  1875  1876  1877
2     3     2     8     6     4     8    12    57    19     5     4     4    15     2     1
1878  1879  1880  1881  1882  1883  1884  1885  1886  1887  1888  1889  1890  1891  1892  1893
3     3   160    49    11     8    12    25    18    15    18    17   427    49    13    14
1894  1895  1896  1897  1898  1899  1900  1901  1902  1903  1904  1905  1906  1907  1908  1909
12    51    42    19    29 24945  9606 25713   135   182   273  7634   755   651   351   878
1910  1911  1912  1913  1914  1915  1916  1917  1918  1919  1920  1921  1922  1923  1924  1925
46787   707   721   579   601 16451   650   517   254   309 91615  1158  1166  1636  2134 70860
1926  1927  1928  1929  1930  1931  1932  1933  1934  1935  1936  1937  1938  1939  1940  1941
3150  3277  4351  2133 77307 32678  1238   967   372 25473   642   667   763   968 38373   746
1942  1943  1944  1945  1946  1947  1948  1949  1950  1951  1952  1953  1954  1955  1956  1957
324   201   177 24982   343   366   604   750 47901   764   692   568   647 26912   764   949
1958  1959  1960  1961  1962  1963  1964  1965  1966  1967  1968  1969  1970  1971  1972  1973
810   927 37219   797   921  1100   947 19659   616   617   631   507 17149   499   631   850
1974  1975  1976  1977  1978  1979  1980  1981  1982  1983  1984  1985  1986  1987  1988  1989
855  9747   657   728  1004   628  5465   553   778  1163  1333  3478  3742  3803  3597  3612
1990  1991  1992  1993  1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005
3508  2264  3173  2460  2130  2560  3478  3165  2763  3727  4192  4544  4189  4450  6537  6016
2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2040
4858  4107  2923  1818  1204  1412  1567  1393  1704  1557    43     1


So we know we have one property that was built in the future (2040)! Argh. Let’s remove that.

all_boroughs<-all_boroughs %>% filter(YearBuilt <=2016)


Let’s get an overview of storeys:

summary(all_boroughs$NumFloors) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.000 2.000 2.000 2.435 2.500 119.000 table(all_boroughs$NumFloors)

0    0.5      1   1.01   1.08    1.1   1.15    1.2   1.25   1.33   1.35    1.4    1.5
1068      1  78843      1      1      2      1      2     50      3      1      1  23751
1.6   1.65   1.66   1.67    1.7   1.75   1.76   1.78    1.8   1.85   1.87    1.9   1.99
18      1      4   8620      1   5090      4      1      2     20      1      8      1
2    2.1    2.2   2.25    2.3   2.33    2.4   2.45    2.5   2.55    2.6   2.66   2.67
414901      1      1     77      5     20      2      1  81966      1      2      2    233
2.7   2.75   2.77    2.8   2.85   2.87    2.9   2.99      3   3.25   3.32   3.33    3.4
2  13443      1      1      3      1      1      1 119695      9      1      3      2
3.5    3.6   3.67   3.75      4   4.25    4.3   4.33    4.5   4.75      5   5.09   5.25
1030      2     12    148  27417      6      9      1    231     17  16639      1      1
5.5   5.75      6   6.25    6.5   6.75      7    7.5      8    8.5      9    9.5     10
62      4  12166      4     40      1   2199     12   1069      4    699      5    550
10.5     11   11.5     12   12.5     13   13.5     14   14.5     15   15.5     16   16.5
4    430      3    913      3    480      1    424      2    412      3    481      6
17     18     19     20   20.5     21     22     23     24     25     26     27     28
232    172    209    284      1    188    109     90     85     77     62     59     54
28.5     29     30     31     32     33     34     35     36     37     38     39     40
1     37     65     60     65     55     36     50     37     23     27     29     29
41     42     43     44     45     46     47     48     49     50     51     52     53
29     38     23     20     20     14     11     14      7     15     14      8      9
54     55     56     57     58     59     60   60.5     61     62     63     64     65
10      1      2     10      7      4      4      1      2      1      2      2      1
66     67     68     69     70     71     73     75     76     77     78     82     85
5      1      2      1      5      1      7      1      1      1      2      1      1
88     90    104    114    119
4      2      1      1      1


We’ve got some “0” floors and one “half a floor”. Remove those bogus data.

all_boroughs<-all_boroughs %>% filter(NumFloors >= 1)


We want to get a quick, understandable grasp of when NYC buildings of certain types really took off (when did 30+ storey buildings become popular?). We’re interested in quick understanding of trends, NOT numerical precision.

I think we want to use spark lines by tens of storeys. Grateful to http://stackoverflow.com/questions/35434760/sparklines-in-ggplot2 for the idea and implementation details!

Let’s create a new column that determines the general height in tens of floors, so we have a value for 0-9 floors, 10-19, 20-29, etc. First, divide the # of floors by 10 and just keep the whole number, discarding any fractional part (so, 3 floors gives us 0, and 15 floors gives us 1). This gives us some categories.

all_boroughs$height_type<-trunc(all_boroughs$NumFloors/10)


Then give understandable labels to each type.

all_boroughs$height_type[which(all_boroughs$height_type>=6)]<-"60+"
all_boroughs$height_type[which(all_boroughs$height_type==0)]<-"0-9"
all_boroughs$height_type[which(all_boroughs$height_type==1)]<-"10-19"
all_boroughs$height_type[which(all_boroughs$height_type==2)]<-"20-29"
all_boroughs$height_type[which(all_boroughs$height_type==3)]<-"30-39"
all_boroughs$height_type[which(all_boroughs$height_type==4)]<-"40-49"
all_boroughs$height_type[which(all_boroughs$height_type==5)]<-"50-59"


Get a basic table that shows for each year and height type, how many buildings we have: We’re going to use two dplyr tools here. The first is “group_by”, which, well, groups data. The second is “summarise” (or “summarize” – you can use either English variant), which allows you to get summary statistics like the number, mean, etc. on a per-group basis. The %>% is another dplyr tool that allows you to chain together commands, so that the output of one becomes the input to the next.

storey_trend<-all_boroughs %>%
group_by(height_type, YearBuilt) %>%
summarise(count=n())


Take a quick peek at our summary table – much more wieldy at only 646 rows!

head(storey_trend)

Source: local data frame [6 x 3]
Groups: height_type [1]

height_type YearBuilt count
<chr>     <int> <int>
1         0-9      1800   164
2         0-9      1801     1
3         0-9      1802     1
4         0-9      1804     1
5         0-9      1805     1
6         0-9      1812     1

tail(storey_trend)

Source: local data frame [6 x 3]
Groups: height_type [1]

height_type YearBuilt count
<chr>     <int> <int>
1         60+      2008     3
2         60+      2009     2
3         60+      2012     2
4         60+      2013     3
5         60+      2014     6
6         60+      2015    16


Plot some spark lines, using the count.

ggplot wants (data_you_use, aes(elements_from_that_data_to_plot)). Then you have to add layers to say how to plot it. The plus sign allows you to add layers. So we start simply:

ggplot(storey_trend, aes(x=YearBuilt, y=count) ) +


We want to divide up our graph into several different graphs, one per kind of height (tens of floors). For that, we’ll use facet_grid.

Facet grid wants to know how you want to plot it by rows ~ columns. We want a row each for each kind of height, and just one column that includes “everything” (aka the dot, which usually means “everything else” when it’s after the tilde ~ symbol). Facet grid also wants to know your scales: do you have the same x and y scale, or can each graph have its own? In our case, we want the time, on the X-axis, to be the same, but we don’t need y to match. Y can be “free”:

facet_grid(height_type ~ ., scales = "free_y") +


But we still haven’t told ggplot what kind of graph we want! Line? Point? Histogram? Boxplot? What geometric element (geom) do we want?

geom_line()


Ok, this is good, but the time series is too jagged. We should have our unit years-in-decades, maybe. Let’s try that! We can do a similar thing to what we did with floors – lop off the last digit of the year and replace it with zero to get the decade.

all_boroughs$decade<-trunc(all_boroughs$YearBuilt/10)\*10


Get a basic table that shows for each year and height type, how many were built:

storey_trend_decade<-all_boroughs %>%
summarise(count=n())


Plot some spark lines, using the count:

ggplot(storey_trend_decade, aes(x=decade, y=count) ) +
facet_grid(height_type ~ ., scales = "free_y") +
geom_line()


This is much more understandable! Let’s make the plot look nicer. But first, how about some 1st-3rd quantile sectioning, to show where building rates were on the low / high side? We can use this data to create a horizontal ribbon in each facet of our graph.

storey_trend_decade <-  storey_trend_decade %>%
group_by(height_type) %>%
summarize(quart1 = quantile(count, 0.25),
quart3 = quantile(count, 0.75)) %>%

# A tibble: 6 x 5
<chr>  <dbl>  <dbl>  <dbl> <int>
1         0-9 130.25  58017   1800   168
2         0-9 130.25  58017   1810     3
3         0-9 130.25  58017   1820    39
4         0-9 130.25  58017   1830    56
5         0-9 130.25  58017   1840   137
6         0-9 130.25  58017   1850   128

# A tibble: 6 x 5
<chr>  <dbl>  <dbl>  <dbl> <int>
1         60+    1.5      8   1930     5
2         60+    1.5      8   1940     1
3         60+    1.5      8   1970     1
4         60+    1.5      8   1980     4
5         60+    1.5      8   2000    11
6         60+    1.5      8   2010    27


Great, let’s do the new ggplot!

First line: data we’re using and “aesthetics” (variables) we’re using to plot

ggplot(storey_trend_decade, aes(x=decade, y=count) ) +

Sectioning our graph into several independent graphs based on a category

facet_grid(height_type ~ ., scales = "free_y") +

Add a layer (a ribbon) that covers the area from 1Q to 3Q for each facet. Label this ribbon in the legend with “1-3 Quartile”.

geom_ribbon(aes(ymin = quart1, max = quart3, fill="1-3 Quartile")) +

And make sure that ribbon is a light grey.

scale_fill_manual(values=c('#eeeeee')) +

On top of that ribbon, plot the line (just as we did before!)

geom_line() +

Instead of the typical ggplot grey background, we want the “black and white” theme, which has some cosmetic improvements, in my opinion.

theme_bw() +

We want to add more years, not just the every-fifty-years decade mark, and to do that, we’ll have to tilt our X-axis labels:

theme(axis.text.x = element_text(angle = 60, hjust = 1)) +

Which labels do we want? A label for every single decade we have in our data. If it’s missing in our data, leave it out of the axis labels, too.

scale_x_continuous(breaks = unique(storey_trend_decade\$decade)) +

But with so many years, it can be hard to follow a single light line all the way up from the bottom of the graph to the top facet. Let’s plot a few vertical lines in a transparent red to help guide the eye. We’ll do them from 1800 to 2000, in 50 year increments.

geom_vline(xintercept=seq(1800, 2000, 50), color="red", alpha=0.3) +

We don’t care about the y values at all, they just clutter the graph, so remove the y axis labels altogether!

scale_y_continuous(breaks = NULL) +

Give the graph a good title, but remove the x and y axis labels, because the title will explain everything quite well.

labs(title="Structures Built in NYC\nby Decade and Number of Floors", x="", y="") +

Finally, remove the legend title also:

theme(legend.title=element_blank())

Looks great!