Mapping Environmental Exposures

This is a step-by-step code lab that allows you to create maps using publicly available data as well as data about your own subjects, when you know their census tracts (although the concept could be extended to any geographic areas, really, as long as you can find a geoJSON file that describes the geographic boundaries).

The goal is to give us better insights about how geographic location, and the things that come along with location (like neighborhood violence, proximity to air pollution, exposure to pathogen-carrying ticks, etc.) affects health. Below, I take real data from the City of Philadelphia and map it along with some fabricated research data, to show you how this might look for your own research.

Want to run this yourself? Feel free to download the complete code


Our research subjects and patients don’t exist in a vacuum – they have environmental exposures to things that promote health (like safe, walkable neighborhoods) and reduce health (like lead exposure or gun violence). There are lots of free, publicly available datasets about environmental exposures in various geographic locations (like Philadelphia specifically, the nation as a whole, etc.) In this document, we’re going to demonstrate how to create a map that shows the number of shootings in Philadelphia in the last few years by census tract, and combine that data with information about CHOP’s patients.

Why might you find this useful? Because perhaps you want to look at how your patients are responding to an exercise intervention for obesity and see if there’s a difference in compliance and weight loss success that could be related to violence near home that make it unsafe to play outside. Or you might want to map lead exposure and rates of ADHD or ER admissions from impulsive behavior. Perhaps there’s a difference in surgical complications related to poverty or food insecurity. Whether you choose to display your findings in a map or simply make inferences based on the data, understanding geographical data like pollution measurements, rates of poverty, areas of high traffic, etc., will help you understand your patients or subjects in an aggregated way, even if you don’t have specific data on their individual experiences.

If you want to run this code, everything you need is right here. You don’t have to use your own data, because I’m going to create fake research data to show you how to work with this kind of data. If you’re new to R, this may seem a bit overwhelming, but if you’ve had a bit of practice with data reshaping and the use of ggplot, you’ll be able to understand each step pretty well.


You’ll want to have a few packages installed. You may already have these, but if not, run the following:


Obtain Geographic Data

The City of Philadelphia supplies information about shootings (including officer-involved shootings) which includes data about the shooting victim and the location. Here, we’re really interested in the location of shootings over the past few years, to understand what parts of Philadelphia are more prone to this specific kind of violence.

To see more information about this dataset, please visit

For our purposes, we’re going to get the bare minimum of information: latitude, longitude, and shooting ID. The API endpoint is described in the link above and uses a SQL query to select only the data we care about. Because our query has spaces and other special characters, we need to “encode” it for request.

The data will come in as json (a file format that has any number of potentially nested key-value pairs).

url <- URLencode(' _id, lat, lng from "a6240077-cbc7-46fb-b554-39417be606ee"')
shooting_data <- fromJSON(url)

We can examine the shooting data by using R’s str (structure) command:

## List of 3
##  $ help   : chr "Execute SQL queries on the DataStore.\n\n    The datastore_search_sql action allows a user to search data in a "| __truncated__
##  $ success: logi TRUE
##  $ result :List of 3
##   ..$ records:'data.frame':  2500 obs. of  3 variables:
##   .. ..$ lat: chr [1:2500] "39.922968633" "40.0296637640001" "40.037804839" "39.9682945860001" ...
##   .. ..$ lng: chr [1:2500] "-75.18870837" "-75.1205859089999" "-75.139286339" "-75.223324033" ...
##   .. ..$ _id: int [1:2500] 1 2 3 4 5 6 7 8 9 10 ...
##   ..$ fields :'data.frame':  3 obs. of  2 variables:
##   .. ..$ type: chr [1:3] "int4" "numeric" "numeric"
##   .. ..$ id  : chr [1:3] "_id" "lat" "lng"
##   ..$ sql    : chr "SELECT _id, lat, lng from \"a6240077-cbc7-46fb-b554-39417be606ee\""

Here we see that we have a nested data frame, accessible at shooting_data$result$records:

head(shooting_data$result$records, 6)
##                lat               lng _id
## 1     39.922968633      -75.18870837   1
## 2 40.0296637640001 -75.1205859089999   2
## 3     40.037804839     -75.139286339   3
## 4 39.9682945860001     -75.223324033   4
## 5     39.995426509     -75.126076954   5
## 6 40.0429494850001 -75.1617552229999   6

Mapping Points

If we wanted to, we could easily create a map of these shootings, just based on latitude and longitude. Since latitude and longitude are currently in “chr” (character) format, we’ll make them numeric so that we can do math on them. We’ll create a map that’s centered on the mean latitude and longitude of all our shootings, and which is appropriately zoomed in (you might have to experiment with the zoom factor).

The result is a clickable, zoomable map. Leaflet is a great way to do quick visualization of geographic data. We’ll come back to Leaflet again at the end, but for now we’ll just take a quick look.


shootings <- shooting_data$result$records
shootings$lat <- as.numeric(shootings$lat)
shootings$lng <- as.numeric(shootings$lng)

shootings %>% 
  leaflet() %>% 
  addTiles() %>%
  setView(lng = mean(as.numeric(shootings$lng), na.rm=TRUE), 
          lat = mean(as.numeric(shootings$lat), na.rm=TRUE), zoom = 10) %>%
  addMarkers(clusterOptions = markerClusterOptions())

Go ahead, click on the clusters to zoom in!

Mapping Polygons

What’s more likely, however, is that we want to use polygon data to create a map that shows how much a particular area is affected. This is because we want to create combined data – we want to put information about our patients or research subjects along with the level of violence they are exposed to. Instead of using latitude and longitude, we’ll gather the number of shootings per Census tract, which we can then use as a proxy for violence exposure for the patients and subjects who live in that Census tract. It’s a sort of “binning”, but using the existing “bins” of Census tracts.

How do we obtain the Census tract for each shooting, using latitude and longitude? We can use a Census tract “shapefile” or a “geojson” file. These are both standard data formats that describe a map by listing polygons, each vertex of which is a lat/long coordinate. Using one of these kinds of files, we can map our locations (in our case, shootings) to the correct polygon.

We get the geojson file from Philadelphia’s Open Data website. This file not only contains the polygon definitions, but also some information about each polygon, like the Census tract name, what state the tract belongs to, several identifiers for that tract, the geographic center of the tract, and so on.

philadelphiaCensusTracts <- readOGR("")
## OGR data source with driver: GeoJSON 
## Source: "", layer: "OGRGeoJSON"
## with 384 features
## It has 14 fields
philadelphiaCensusBlocks <- readOGR("")
## OGR data source with driver: GeoJSON 
## Source: "", layer: "OGRGeoJSON"
## with 18872 features
## It has 18 fields

Let’s take a peek at the data that comes along with our geojson!

## 0        1        42        101    009400 42101009400     94
## 1        2        42        101    009500 42101009500     95
## 2        3        42        101    009600 42101009600     96
## 3        4        42        101    013800 42101013800    138
## 4        5        42        101    013900 42101013900    139
## 5        6        42        101    014000 42101014000    140
## 0  Census Tract 94   G5020          S  366717        0 +39.9632709
## 1  Census Tract 95   G5020          S  319070        0 +39.9658709
## 2  Census Tract 96   G5020          S  405273        0 +39.9655396
## 3 Census Tract 138   G5020          S  341256        0 +39.9764504
## 4 Census Tract 139   G5020          S  562934        0 +39.9750563
## 5 Census Tract 140   G5020          S  439802        0 +39.9735358
## 0 -075.2322437    10429
## 1 -075.2379140    10430
## 2 -075.2435075    10431
## 3 -075.1771771    10468
## 4 -075.1711846    10469
## 5 -075.1630966    10470

Option 1: Static Maps

We can specify Census tract by various methods – by the full code number GEOID10, by the short code NAME10, or by the human-friendly name NAMELSAD10. For now, I’ll use the Census short code.

I’ll get that complex data type, then make a long data frame of its geography which I’ll call philadelphiaCensusTracts_fortified, which will set a “group” (region) equal to the NAME10 aspect. This allows me to easily join data to it later on.

philadelphiaCensusTracts_fortified <- tidy(philadelphiaCensusTracts, region = "NAME10")

What does this geojson data “look” like, when we plot the various polygons it includes? Each polygon represents a Census tract from the 2010 Census.

philly_plain <- ggplot() + 
               aes(x=long, y=lat, group=group, fill=NA), 
               color = "black", fill=NA, size=0.1) +
  coord_map() +