Mapping Environmental Exposures
This is a step-by-step code lab that allows you to create maps using publicly available data as well as data about your own subjects, when you know their census tracts (although the concept could be extended to any geographic areas, really, as long as you can find a geoJSON file that describes the geographic boundaries).
The goal is to give us better insights about how geographic location, and the things that come along with location (like neighborhood violence, proximity to air pollution, exposure to pathogen-carrying ticks, etc.) affects health. Below, I take real data from the City of Philadelphia and map it along with some fabricated research data, to show you how this might look for your own research.
Want to run this yourself? Feel free to download the complete code
Enriching Philadelphia GeoJSON with Shooting Data
Our research subjects and patients don’t exist in a vacuum – they have environmental exposures to things that promote health (like safe, walkable neighborhoods) and reduce health (like lead exposure or gun violence). There are lots of free, publicly available datasets about environmental exposures in various geographic locations (like Philadelphia specifically, the nation as a whole, etc.) In this document, we’re going to demonstrate how to create a map that shows the number of shootings in Philadelphia in the last few years by census tract, and combine that data with information about CHOP’s patients.
Why might you find this useful? Because perhaps you want to look at how your patients are responding to an exercise intervention for obesity and see if there’s a difference in compliance and weight loss success that could be related to violence near home that make it unsafe to play outside. Or you might want to map lead exposure and rates of ADHD or ER admissions from impulsive behavior. Perhaps there’s a difference in surgical complications related to poverty or food insecurity. Whether you choose to display your findings in a map or simply make inferences based on the data, understanding geographical data like pollution measurements, rates of poverty, areas of high traffic, etc., will help you understand your patients or subjects in an aggregated way, even if you don’t have specific data on their individual experiences.
If you want to run this code, everything you need is right here. You don’t have to use your own data, because I’m going to create fake research data to show you how to work with this kind of data. If you’re new to R, this may seem a bit overwhelming, but if you’ve had a bit of practice with data reshaping and the use of ggplot, you’ll be able to understand each step pretty well.
You’ll want to have a few packages installed. You may already have these, but if not, run the following:
install.packages("dplyr") install.packages("leaflet") install.packages("jsonlite") install.packages("ggplot2") install.packages("maptools") install.packages("sp") install.packages("rgdal") install.packages("scales")
Obtain Geographic Data
The City of Philadelphia supplies information about shootings (including officer-involved shootings) which includes data about the shooting victim and the location. Here, we’re really interested in the location of shootings over the past few years, to understand what parts of Philadelphia are more prone to this specific kind of violence.
To see more information about this dataset, please visit https://www.opendataphilly.org/dataset/shooting-victims/resource/a6240077-cbc7-46fb-b554-39417be606ee?inner_span=True.
For our purposes, we’re going to get the bare minimum of information: latitude, longitude, and shooting ID. The API endpoint is described in the link above and uses a SQL query to select only the data we care about. Because our query has spaces and other special characters, we need to “encode” it for request.
The data will come in as json (a file format that has any number of potentially nested key-value pairs).
library(jsonlite) url <- URLencode('https://www.opendataphilly.org/api/action/datastore_search_sql?sql=SELECT _id, lat, lng from "a6240077-cbc7-46fb-b554-39417be606ee"') shooting_data <- fromJSON(url)
We can examine the shooting data by using R’s
str (structure) command:
## List of 3 ## $ help : chr "Execute SQL queries on the DataStore.\n\n The datastore_search_sql action allows a user to search data in a "| __truncated__ ## $ success: logi TRUE ## $ result :List of 3 ## ..$ records:'data.frame': 2500 obs. of 3 variables: ## .. ..$ lat: chr [1:2500] "39.922968633" "40.0296637640001" "40.037804839" "39.9682945860001" ... ## .. ..$ lng: chr [1:2500] "-75.18870837" "-75.1205859089999" "-75.139286339" "-75.223324033" ... ## .. ..$ _id: int [1:2500] 1 2 3 4 5 6 7 8 9 10 ... ## ..$ fields :'data.frame': 3 obs. of 2 variables: ## .. ..$ type: chr [1:3] "int4" "numeric" "numeric" ## .. ..$ id : chr [1:3] "_id" "lat" "lng" ## ..$ sql : chr "SELECT _id, lat, lng from \"a6240077-cbc7-46fb-b554-39417be606ee\""
Here we see that we have a nested data frame, accessible at
## lat lng _id ## 1 39.922968633 -75.18870837 1 ## 2 40.0296637640001 -75.1205859089999 2 ## 3 40.037804839 -75.139286339 3 ## 4 39.9682945860001 -75.223324033 4 ## 5 39.995426509 -75.126076954 5 ## 6 40.0429494850001 -75.1617552229999 6
If we wanted to, we could easily create a map of these shootings, just based on latitude and longitude. Since latitude and longitude are currently in “chr” (character) format, we’ll make them numeric so that we can do math on them. We’ll create a map that’s centered on the mean latitude and longitude of all our shootings, and which is appropriately zoomed in (you might have to experiment with the zoom factor).
The result is a clickable, zoomable map. Leaflet is a great way to do quick visualization of geographic data. We’ll come back to Leaflet again at the end, but for now we’ll just take a quick look.
Go ahead, click on the clusters to zoom in!
library(leaflet) library(dplyr) shootings <- shooting_data$result$records shootings$lat <- as.numeric(shootings$lat) shootings$lng <- as.numeric(shootings$lng) shootings %>% leaflet() %>% addTiles() %>% setView(lng = mean(as.numeric(shootings$lng), na.rm=TRUE), lat = mean(as.numeric(shootings$lat), na.rm=TRUE), zoom = 10) %>% addMarkers(clusterOptions = markerClusterOptions())
What’s more likely, however, is that we want to use polygon data to create a map that shows how much a particular area is affected. This is because we want to create combined data – we want to put information about our patients or research subjects along with the level of violence they are exposed to. Instead of using latitude and longitude, we’ll gather the number of shootings per Census tract, which we can then use as a proxy for violence exposure for the patients and subjects who live in that Census tract. It’s a sort of “binning”, but using the existing “bins” of Census tracts.
How do we obtain the Census tract for each shooting, using latitude and longitude? We can use a Census tract “shapefile” or a “geojson” file. These are both standard data formats that describe a map by listing polygons, each vertex of which is a lat/long coordinate. Using one of these kinds of files, we can map our locations (in our case, shootings) to the correct polygon.
We get the geojson file from Philadelphia’s Open Data website. This file not only contains the polygon definitions, but also some information about each polygon, like the Census tract name, what state the tract belongs to, several identifiers for that tract, the geographic center of the tract, and so on.
library(rgdal) philadelphiaCensusTracts <- readOGR("http://data.phl.opendata.arcgis.com/datasets/8bc0786524a4486bb3cf0f9862ad0fbf_0.geojson")
## OGR data source with driver: GeoJSON ## Source: "http://data.phl.opendata.arcgis.com/datasets/8bc0786524a4486bb3cf0f9862ad0fbf_0.geojson", layer: "OGRGeoJSON" ## with 384 features ## It has 14 fields
philadelphiaCensusBlocks <- readOGR("http://data.phl.opendata.arcgis.com/datasets/e9e2e152bc1644e2af84927a8f4c3c06_0.geojson")
## OGR data source with driver: GeoJSON ## Source: "http://data.phl.opendata.arcgis.com/datasets/e9e2e152bc1644e2af84927a8f4c3c06_0.geojson", layer: "OGRGeoJSON" ## with 18872 features ## It has 18 fields
Let’s take a peek at the data that comes along with our geojson!
## OBJECTID STATEFP10 COUNTYFP10 TRACTCE10 GEOID10 NAME10 ## 0 1 42 101 009400 42101009400 94 ## 1 2 42 101 009500 42101009500 95 ## 2 3 42 101 009600 42101009600 96 ## 3 4 42 101 013800 42101013800 138 ## 4 5 42 101 013900 42101013900 139 ## 5 6 42 101 014000 42101014000 140 ## NAMELSAD10 MTFCC10 FUNCSTAT10 ALAND10 AWATER10 INTPTLAT10 ## 0 Census Tract 94 G5020 S 366717 0 +39.9632709 ## 1 Census Tract 95 G5020 S 319070 0 +39.9658709 ## 2 Census Tract 96 G5020 S 405273 0 +39.9655396 ## 3 Census Tract 138 G5020 S 341256 0 +39.9764504 ## 4 Census Tract 139 G5020 S 562934 0 +39.9750563 ## 5 Census Tract 140 G5020 S 439802 0 +39.9735358 ## INTPTLON10 LOGRECNO ## 0 -075.2322437 10429 ## 1 -075.2379140 10430 ## 2 -075.2435075 10431 ## 3 -075.1771771 10468 ## 4 -075.1711846 10469 ## 5 -075.1630966 10470
Option 1: Static Maps
We can specify Census tract by various methods – by the full code number
GEOID10, by the short code
NAME10, or by the human-friendly name
NAMELSAD10. For now, I’ll use the Census short code.
I’ll get that complex data type, then make a long data frame of its geography which I’ll call
philadelphiaCensusTracts_fortified, which will set a “group” (region) equal to the
NAME10 aspect. This allows me to easily join data to it later on.
library(broom) philadelphiaCensusTracts_fortified <- tidy(philadelphiaCensusTracts, region = "NAME10")
What does this geojson data “look” like, when we plot the various polygons it includes? Each polygon represents a Census tract from the 2010 Census.
library(ggplot2) library(ggmap) philly_plain <- ggplot() + geom_polygon(data=philadelphiaCensusTracts_fortified, aes(x=long, y=lat, group=group, fill=NA), color = "black", fill=NA, size=0.1) + coord_map() + theme_nothing() print(philly_plain)