Using GIS to Study Election Polling Places

In an incredible example of data journalism, the Center for Public Integrity has done the incredibly hard work of collecting the locations of polling places in the US for a huge number of states from 2012-2018 (there also gathering data for 2020).

For this exercise, we will be using some of their data to do some basic analyses of polling places.

Exercise 1

Visit the github repo of the Center for Public Integrity and download the most recent polling place data for a state of your choice. Load the data.

Exercise 2: Geocoding

In this data, you will find a column called “Address” with the addresses of each polling place. Unfortunately, an address – saved as a string – is not valid geospatial data, so we need to find a way to convert these addresses into latitudes and longitudes.

The process of converting addresses into latitude and longitude is called geocoding, and can be done in a number of different ways.

If you’re willing to spend a little bit of money, my own favorite tool is – upload a spreadsheet of addresses, and they will geocode them at a very reasonable price (all the polling places in North Carolina for a year costs about $1.50), and you can also get data attached to your spreadsheet like information from the U.S. Census for a little bit extra.

Alternatively, you can use the geocode tool in geopandas to geocode addresses programmatically. The only limitation is that services that do geocoding are usually either quite limited in the number of addresses you can query per day, require an account code, or for some, both. Note that tool requires installing geopy, an optional geopandas dependency.

Use this tool to geocode 20 of your polling place addresses (take a random sample) using provider="ArcGIS". Note this may take a while, as photon is rate-limited (it only accepts queries coming in so fast). ArcGIS is only free for a few queries, so we don’t wanna get booted by getting greedy. :)

(Note that if you’re trying to do this in a classroom, the geocoding service may not be able to distinguish between different students with the same IP address so this may fail miserably. 🤣)

Use .plot() to plot the resulting data.

This geocode tool actually supports LOTS of geocoders, and while most of them aren’t free, if you were doing this professionally you could get an API key to any of them to make this work with, say, google maps, or for pretty cheap.

Exercise 3: Spatial CSV

Since that may or may not have worked for you depending on whether you got throttled by photon, we’ll now turn to another way of working with spatial points data!

While most geospatial data is stored in specialized file formats like shapefiles or GeoJSON, when geospatial data only includes the location of points, the data can actually be stored in a normal tabular data format like csv where the x-coordinate of each point is stored in one column and the y-coordinate is stored in another.

I’ve gone ahead and geo-coded all the polling place addresses for North Carolina in 2018. Please get those here and load them as a regular pandas dataframe.

Exercise 4: Convert to Spatial

Because this format of GIS point data is so common, geopandas has a special tool for converting regular dataframes with x and y coordinate columns to points: points_from_xy(). You can find an example here.

Convert the dataframe you just loaded to a geodataframe with points for all polling places using the points_from_xy() method.

Plot the result. Does it look like North Carolina?

Exercise 5: Set the CRS

While you have successfully specified which columns contain x and y coordinates, you have NOT told Python the coordinate reference system of these points. Please SET the coordinate reference system to WGS84 (the CRS of latitudes and longitudes), also known as EPSG code 4326. Without doing this, you can neither re-project nor combine this data with other data!

Note we are SETTING, not RE-PROJECTING, a super important distinction!

Exercise 6: Accuracy Checks

Geocoding can be hard – two columns you get from are Accuracy Score and Accuracy Type. Check the distribution of accuracy types to see how well your geocoding actually went. You can read about accuracy types here.

Do you see any values you think you might not want to trust / might want to check by hand?

Exercise 7: Re-project

Latitude and longitude (epsg 4326) is not a very good projection for… well anything. Suppose that were primarily interested in measuring the distances between voters and their polling places. What type of projection do you think we would want to use?

Exercise 8: Re-Project for real

You can find a good projection for this purpose here. Use it to re-project your data. When you plot it does it look different than your plot above?

Exercise 9: Subset

Geodataframes implement everything that was originally available in a normal pandas dataframe. To illustrate, subset your data to the county of Durham and plot it.

Exercise 10: Combine data

You can find a shapefile that contains all of the US Census Blocks for the County of Durham here. Download this file and load it using the geopandas function read_file. Plot it.

Exercise 11: Map with Layers

Following the maps with layers directions, overlay these two maps. Note that before you make your figure you need to project your maps into the same projection! I suggest using the projection you’re already using for polling places.

(Note that the shapefile with durham county blocks included projection information, so you don’t have to set the projection. You can see this by just checking the .crs property to see that it’s defined).

Make sure that your polling place locations stand out.

In this case, the block data we pulled in doesn’t have any demographic data, but it has identifiers that allow it to be easily linked to demographic data from the census, so hopefully you can see how we can easily now relate polling places to the demographics of the communities around them.