Sparklines in ggplot2
Creating a Sparkline Visualization in ggplot2
Here’s how to come up with a faceted (multiple graphs showing “facets” of the data by some category) ggplot 2 data visualization that uses a sparkline approach. The final product will look like this:
This was from an assignment in my grad data viz class and I think it could be useful for others as well. We want a very simplified, relative frequency graph that shows building trends (so, when 20+ floor buildings spiked, etc.). We’re not interested in comparing counts across height categories (for example, there are many more 1-10 floor buildings than 60+ floor buildings, but showing that comparison would mean our y scale would be very tall and detail would be hard to make out). Nope, we’re interested in relative change within categories. This tutorial will show you lots of detail about ggplot 2 and how to use it well!
We load necessary libraries:
library(dplyr)
library(ggplot2)
Then we obtain data – I’ve already obtained this from NYC public data and limited it to just the year built and number of floors, and added it to my github. It’s over 800k rows, so it takes awhile to load!
all_boroughs<-read.csv("https://raw.githubusercontent.com/pm0kjp/datastore/master/NYC_building_data.csv")
First, let’s eyeball the data before moving further on:
table(all_boroughs$YearBuilt)
0 1661 1665 1706 1729 1765 1779 1780 1785 1798 1799 1800 1801 1802 1804 1805
43388 1 1 1 1 1 1 1 2 1 1 166 1 1 1 1
1812 1814 1816 1821 1822 1823 1824 1825 1826 1827 1829 1830 1831 1832 1833 1834
1 1 1 2 3 1 5 3 5 3 20 11 6 5 1 7
1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850
6 7 2 5 6 21 9 11 11 18 19 8 9 13 19 43
1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866
8 14 12 6 13 11 6 7 9 35 3 2 3 2 8 6
1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1881 1882
4 8 12 57 19 5 4 4 15 2 1 3 3 160 49 11
1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897 1898
8 12 25 18 15 18 17 427 49 13 14 12 51 42 19 29
1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914
24945 9606 25713 135 182 273 7634 755 651 351 878 46787 707 721 579 601
1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930
16451 650 517 254 309 91615 1158 1166 1636 2134 70860 3150 3277 4351 2133 77307
1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946
32678 1238 967 372 25473 642 667 763 968 38373 746 324 201 177 24982 343
1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962
366 604 750 47901 764 692 568 647 26912 764 949 810 927 37219 797 921
1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978
1100 947 19659 616 617 631 507 17149 499 631 850 855 9747 657 728 1004
1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994
628 5465 553 778 1163 1333 3478 3742 3803 3597 3612 3508 2264 3173 2460 2130
1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
2560 3478 3165 2763 3727 4192 4544 4189 4450 6537 6016 4858 4107 2923 1818 1204
2011 2012 2013 2014 2015 2016 2040
1412 1567 1393 1704 1557 43 1
Sheesh, we have a lot build in year “0”. Let’s remove these!
all_boroughs<-all_boroughs %>% filter(YearBuilt!=0)
That’s better. Now let’s limit to post-1800, because we don’t care about really old building trends!
all_boroughs<-all_boroughs %>% filter (YearBuilt >=1800)
Let's look again.
table(all_boroughs$YearBuilt)
1800 1801 1802 1804 1805 1812 1814 1816 1821 1822 1823 1824 1825 1826 1827 1829
166 1 1 1 1 1 1 1 2 3 1 5 3 5 3 20
1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845
11 6 5 1 7 6 7 2 5 6 21 9 11 11 18 19
1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861
8 9 13 19 43 8 14 12 6 13 11 6 7 9 35 3
1862 1863 1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877
2 3 2 8 6 4 8 12 57 19 5 4 4 15 2 1
1878 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893
3 3 160 49 11 8 12 25 18 15 18 17 427 49 13 14
1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909
12 51 42 19 29 24945 9606 25713 135 182 273 7634 755 651 351 878
1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925
46787 707 721 579 601 16451 650 517 254 309 91615 1158 1166 1636 2134 70860
1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941
3150 3277 4351 2133 77307 32678 1238 967 372 25473 642 667 763 968 38373 746
1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957
324 201 177 24982 343 366 604 750 47901 764 692 568 647 26912 764 949
1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973
810 927 37219 797 921 1100 947 19659 616 617 631 507 17149 499 631 850
1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989
855 9747 657 728 1004 628 5465 553 778 1163 1333 3478 3742 3803 3597 3612
1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005
3508 2264 3173 2460 2130 2560 3478 3165 2763 3727 4192 4544 4189 4450 6537 6016
2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2040
4858 4107 2923 1818 1204 1412 1567 1393 1704 1557 43 1
So we know we have one property that was built in the future (2040)! Argh. Let’s remove that.
all_boroughs<-all_boroughs %>% filter(YearBuilt <=2016)
Let’s get an overview of storeys:
summary(all_boroughs$NumFloors)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 2.000 2.000 2.435 2.500 119.000
table(all_boroughs$NumFloors)
0 0.5 1 1.01 1.08 1.1 1.15 1.2 1.25 1.33 1.35 1.4 1.5
1068 1 78843 1 1 2 1 2 50 3 1 1 23751
1.6 1.65 1.66 1.67 1.7 1.75 1.76 1.78 1.8 1.85 1.87 1.9 1.99
18 1 4 8620 1 5090 4 1 2 20 1 8 1
2 2.1 2.2 2.25 2.3 2.33 2.4 2.45 2.5 2.55 2.6 2.66 2.67
414901 1 1 77 5 20 2 1 81966 1 2 2 233
2.7 2.75 2.77 2.8 2.85 2.87 2.9 2.99 3 3.25 3.32 3.33 3.4
2 13443 1 1 3 1 1 1 119695 9 1 3 2
3.5 3.6 3.67 3.75 4 4.25 4.3 4.33 4.5 4.75 5 5.09 5.25
1030 2 12 148 27417 6 9 1 231 17 16639 1 1
5.5 5.75 6 6.25 6.5 6.75 7 7.5 8 8.5 9 9.5 10
62 4 12166 4 40 1 2199 12 1069 4 699 5 550
10.5 11 11.5 12 12.5 13 13.5 14 14.5 15 15.5 16 16.5
4 430 3 913 3 480 1 424 2 412 3 481 6
17 18 19 20 20.5 21 22 23 24 25 26 27 28
232 172 209 284 1 188 109 90 85 77 62 59 54
28.5 29 30 31 32 33 34 35 36 37 38 39 40
1 37 65 60 65 55 36 50 37 23 27 29 29
41 42 43 44 45 46 47 48 49 50 51 52 53
29 38 23 20 20 14 11 14 7 15 14 8 9
54 55 56 57 58 59 60 60.5 61 62 63 64 65
10 1 2 10 7 4 4 1 2 1 2 2 1
66 67 68 69 70 71 73 75 76 77 78 82 85
5 1 2 1 5 1 7 1 1 1 2 1 1
88 90 104 114 119
4 2 1 1 1
We’ve got some “0” floors and one “half a floor”. Remove those bogus data.
all_boroughs<-all_boroughs %>% filter(NumFloors >= 1)
We want to get a quick, understandable grasp of when NYC buildings of certain types really took off (when did 30+ storey buildings become popular?). We’re interested in quick understanding of trends, NOT numerical precision.
I think we want to use spark lines by tens of storeys. Grateful to http://stackoverflow.com/questions/35434760/sparklines-in-ggplot2 for the idea and implementation details!
Let’s create a new column that determines the general height in tens of floors, so we have a value for 0-9 floors, 10-19, 20-29, etc. First, divide the # of floors by 10 and just keep the whole number, discarding any fractional part (so, 3 floors gives us 0, and 15 floors gives us 1). This gives us some categories.
all_boroughs$height_type<-trunc(all_boroughs$NumFloors/10)
Then give understandable labels to each type.
all_boroughs$height_type[which(all_boroughs$height_type>=6)]<-"60+"
all_boroughs$height_type[which(all_boroughs$height_type==0)]<-"0-9"
all_boroughs$height_type[which(all_boroughs$height_type==1)]<-"10-19"
all_boroughs$height_type[which(all_boroughs$height_type==2)]<-"20-29"
all_boroughs$height_type[which(all_boroughs$height_type==3)]<-"30-39"
all_boroughs$height_type[which(all_boroughs$height_type==4)]<-"40-49"
all_boroughs$height_type[which(all_boroughs$height_type==5)]<-"50-59"
Get a basic table that shows for each year and height type, how many buildings we have: We’re going to use two dplyr tools here. The first is “group_by”, which, well, groups data. The second is “summarise” (or “summarize” – you can use either English variant), which allows you to get summary statistics like the number, mean, etc. on a per-group basis. The %>% is another dplyr tool that allows you to chain together commands, so that the output of one becomes the input to the next.
storey_trend<-all_boroughs %>%
group_by(height_type, YearBuilt) %>%
summarise(count=n())
Take a quick peek at our summary table – much more wieldy at only 646 rows!
head(storey_trend)
Source: local data frame [6 x 3]
Groups: height_type [1]
height_type YearBuilt count
<chr> <int> <int>
1 0-9 1800 164
2 0-9 1801 1
3 0-9 1802 1
4 0-9 1804 1
5 0-9 1805 1
6 0-9 1812 1
tail(storey_trend)
Source: local data frame [6 x 3]
Groups: height_type [1]
height_type YearBuilt count
<chr> <int> <int>
1 60+ 2008 3
2 60+ 2009 2
3 60+ 2012 2
4 60+ 2013 3
5 60+ 2014 6
6 60+ 2015 16
Plot some spark lines, using the count.
ggplot wants (data_you_use, aes(elements_from_that_data_to_plot)). Then you have to add layers to say how to plot it. The plus sign allows you to add layers. So we start simply:
ggplot(storey_trend, aes(x=YearBuilt, y=count) ) +
We want to divide up our graph into several different graphs, one per kind of height (tens of floors). For that, we’ll use facet_grid.
Facet grid wants to know how you want to plot it by rows ~ columns. We want a row each for each kind of height, and just one column that includes “everything” (aka the dot, which usually means “everything else” when it’s after the tilde ~ symbol). Facet grid also wants to know your scales: do you have the same x and y scale, or can each graph have its own? In our case, we want the time, on the X-axis, to be the same, but we don’t need y to match. Y can be “free”:
facet_grid(height_type ~ ., scales = "free_y") +
But we still haven’t told ggplot what kind of graph we want! Line? Point? Histogram? Boxplot? What geometric element (geom) do we want?
geom_line()
Ok, this is good, but the time series is too jagged. We should have our unit years-in-decades, maybe. Let’s try that! We can do a similar thing to what we did with floors – lop off the last digit of the year and replace it with zero to get the decade.
all_boroughs$decade<-trunc(all_boroughs$YearBuilt/10)\*10
Get a basic table that shows for each year and height type, how many were built:
storey_trend_decade<-all_boroughs %>%
group_by(height_type, decade) %>%
summarise(count=n())
Plot some spark lines, using the count:
ggplot(storey_trend_decade, aes(x=decade, y=count) ) +
facet_grid(height_type ~ ., scales = "free_y") +
geom_line()
This is much more understandable! Let’s make the plot look nicer. But first, how about some 1st-3rd quantile sectioning, to show where building rates were on the low / high side? We can use this data to create a horizontal ribbon in each facet of our graph.
storey_trend_decade <- storey_trend_decade %>%
group_by(height_type) %>%
summarize(quart1 = quantile(count, 0.25),
quart3 = quantile(count, 0.75)) %>%
right_join(storey_trend_decade)
head(storey_trend_decade)
# A tibble: 6 x 5
height_type quart1 quart3 decade count
<chr> <dbl> <dbl> <dbl> <int>
1 0-9 130.25 58017 1800 168
2 0-9 130.25 58017 1810 3
3 0-9 130.25 58017 1820 39
4 0-9 130.25 58017 1830 56
5 0-9 130.25 58017 1840 137
6 0-9 130.25 58017 1850 128
tail(storey_trend_decade)
# A tibble: 6 x 5
height_type quart1 quart3 decade count
<chr> <dbl> <dbl> <dbl> <int>
1 60+ 1.5 8 1930 5
2 60+ 1.5 8 1940 1
3 60+ 1.5 8 1970 1
4 60+ 1.5 8 1980 4
5 60+ 1.5 8 2000 11
6 60+ 1.5 8 2010 27
Great, let’s do the new ggplot!
First line: data we’re using and “aesthetics” (variables) we’re using to plot
ggplot(storey_trend_decade, aes(x=decade, y=count) ) +
Sectioning our graph into several independent graphs based on a category
facet_grid(height_type ~ ., scales = "free_y") +
Add a layer (a ribbon) that covers the area from 1Q to 3Q for each facet. Label this ribbon in the legend with “1-3 Quartile”.
geom_ribbon(aes(ymin = quart1, max = quart3, fill="1-3 Quartile")) +
And make sure that ribbon is a light grey.
scale_fill_manual(values=c('#eeeeee')) +
On top of that ribbon, plot the line (just as we did before!)
geom_line() +
Instead of the typical ggplot grey background, we want the “black and white” theme, which has some cosmetic improvements, in my opinion.
theme_bw() +
We want to add more years, not just the every-fifty-years decade mark, and to do that, we’ll have to tilt our X-axis labels:
theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
Which labels do we want? A label for every single decade we have in our data. If it’s missing in our data, leave it out of the axis labels, too.
scale_x_continuous(breaks = unique(storey_trend_decade$decade)) +
But with so many years, it can be hard to follow a single light line all the way up from the bottom of the graph to the top facet. Let’s plot a few vertical lines in a transparent red to help guide the eye. We’ll do them from 1800 to 2000, in 50 year increments.
geom_vline(xintercept=seq(1800, 2000, 50), color="red", alpha=0.3) +
We don’t care about the y values at all, they just clutter the graph, so remove the y axis labels altogether!
scale_y_continuous(breaks = NULL) +
Give the graph a good title, but remove the x and y axis labels, because the title will explain everything quite well.
labs(title="Structures Built in NYC\nby Decade and Number of Floors",
x="", y="") +
Finally, remove the legend title also:
theme(legend.title=element_blank())
Looks great!