Delhi Air Quality and Voting#

One of the best uses of geospatial tools is to link different data sources based on their spatial relationship. Often, this can make it possible to study the relationship between different phenomena in the world that would otherwise be impossible to get into the same database.

At the same time, relating different spatial data is not always easy. Administrative data, for example, is often aggregated to the level of administrative units whose boundaries may be defined by oddly shaped polygons. The mapping of these polygons to other datasets — that may be organized into polygons that are not neatly nested into administrative boundaries — is not straight forward.

In this exercise, we will relate data on air pollution with data on voting behavior. Both will be provided as GeoDataFrames of polygons, but these polygons do not relate neatly to one another. Data on voting is aggregated to the level of electoral districts, while data on air pollution comes in the form of a regular grid of square polygons. In some cases — like when an entire air pollution polygon fits inside an electoral district — its easy to tell how the datasets should relate. But in other cases — like when an air pollution polygon intersects multiple districts — the mapping between datasets will be less clear.

Gradescope Autograding#

Please follow all standard guidance for submitting this assignment to the Gradescope autograder, including storing your solutions in a dictionary called results and ensuring your notebook runs from the start to completion without any errors.

For this assignment, please name your file exercise_spatialjoins.ipynb before uploading.

You can check that you have answers for all questions in your results dictionary with this code:

assert set(results.keys()) == {
    "ex3_aap_avg",
    "ex4_merge_type",
    "ex7_no_pollution",
    "ex8_n_in_raw",
    "ex8_n_in_merged",
    "ex9_within_corr",
    "ex10_no_pollution",
    "ex10_n_in_merged",
    "ex11_intersect_corr",
    "ex12_within_corr",
    "ex13_num_intersected_polygons",
    "ex15_weighted_corr",
}

Submission Limits#

Please remember that you are only allowed THREE submissions to the autograder. Your last submission (if you submit 3 or fewer times), or your third submission (if you submit more than 3 times) will determine your grade Submissions that error out will not count against this total.

Exercise 1#

Please load the air pollution data we will use for this exercise from this URL. What you will get is a GeoDataFrame, but if you plot it, you will see it looks a lot like the raster we created by kriging in the last exercise. That’s because… it is! Or rather, this is a vectorized version of that data — we took that grid, created a polygon for each grid square, and put it in a GeoDataFrame as its own row with the associated value.

As you will see in a few classes, this is probably not how you will usually work with raster data, though it is also not unheard of. But it gives us a good example of a GeoDataFrame of polygons that are not neatly nested in the other data we will use, and since we haven’t worked with raster data much before it allows us to explore some important ideas that apply whether one is working with vector or raster GIS data.

So please load the data and plot it.

Exercise 2#

Our goal will be to correlate this air pollution data with data on how people vote. Data on voting is not provided at the individual level (for what I hope are obvious reasons). Rather, it is provided at the level of electoral districts.

In this exercise, we will be working with the electoral districts used for State Assembly elections in India. Please load the spatial data for these districts in Delhi from this URL. These district boundaries are from Data{Meet} Community Maps Project, but the link will give you a version that is already filtered down to Delhi.

For Indiaphiles: These appear to have been scraped in 2016, and so at least for Delhi are Post-Delimitation boundaries.

Please plot these boundaries.

Exercise 3#

As you look at the electoral district boundary GeoDataFrame, you will notice a pronounced lack of electoral data! This is actually not all that uncommon — many times spatial boundaries and the tabular data you may wish to join with the spatial boundaries are provided separately.

To get actual vote counts, please download the 2020 Delhi Assembly electoral returns from here. This is a filtered version of data from this site, and you can find the codebook for the data here.

Each row of this dataset reports the election result for the political party that held a majority in Delhi at the time of this election — Aam Aadmi Party (AAP). Political scientists often theorize that when things go wrong, voters tend to punish whatever power is in power at the time. So for this analysis, we will be looking at whether places in Delhi with worse air pollution were less likely to vote for the incumbent party.

You can find the share of votes in each district that went to the candidate from the incumbent party in the variable incumbent_party_vote_share. Calculate the average AAP vote share across electoral districts and store your answer in "ex3_aap_avg".

Exercise 4#

Now let’s merge our electoral data with our election district boundary file. This will not be a spatial join, but rather a normal merge on a variable — in this case, we will be merging AC_NO in our GeoDataFrame to Constituency_No in our election results.

Note: to ensure you get back a GeoDataFrame, it is recommended you use the .merge syntax with the GeoDataFrame coming first:

my_geodataframe.merge(my_pandas_dataframe, on="common_var")

You can also get away with pd.merge if the GeoDataFrame is in the left position, apparently, but the above syntax is recommended. If you don’t do this, the merge will work but you’ll end up with a pandas DataFrame rather than a GeoDataFrame.

So now merge your data. Store the kind of merge you are doing as a string ("1:1", "1:m", "m:1", or "m:m") under "ex4_merge_type". (forget about merge validation? Re-read here!)

Now make a new map of electoral districts colored by the vote share won by the AAP.

Exercise 5#

As a first, simple exercise, plot both electoral districts and air pollution on the same plot. Color air quality monitor markers according to SO2 levels and electoral districts according to AAP vote share.

Note: if this doesn’t work at first, you probably forgot something important! :)

Correlating Pollution and Incumbent Party Support#

We now have two spatial datasets — one of election results organized by electoral district polygons, and one of air pollution organized in regular grid cells.

We now want to do a spatial join of these two datasets — but how do we do it? Ignoring the details of how to get it to happen in Python, how would you use these squares to estimate the average air pollution in each district? Would you average over all the polygons that are fully in the district? Would you average over all the polygons that touch the district? If we average over all the polygons that touch a district, is it a problem that we’re giving as much weight to a polygon that just touches the district as one that’s fully in the district?

In our next few exercises, we are going to explore a few strategies and see how they play out.

Exercise 6#

The geopandas sjoin method is designed for precisely this type of situation, and can easily relate datasets based on different types of spatial relations. In particular, one can join records that have any of the following relationships to one another:

  • intersects

  • contains

  • within

  • touches

  • crosses

  • overlaps

(You can review the Merging Data documentation for more details)

Let’s start with the simplest option: joining our constituency polygons to the pollution polygons that are fully within each constituency. This should be a left join.

Note to ensure we’re all working with the same data, make sure you’re using UTM zone 43N as your CRS. You probably switched to it in Exercise 5, but if you projected both datasets into Lat-Long (epsg 4326), well… shame on you. :) Use UTM zone 43N.

NOTE 2: Be careful with contains and within, as it’s easy to get them backwards. GeoPandas thinks about things in terms of the first GeoDataFrame called, so df1.sjoin(df2, predicate="contains") will get you all the shapes from df2 that are within df1 (the shapes that the df1 polygons “contains”). Personally, I usually find the correct predicate is the opposite of what I think it should be intuitively.

When you are done, store the number of rows in your resulting dataset in "ex6_nrows".

Exercise 7#

How many constituencies have no SO2 pollution data? Store your answer in "ex7_no_pollution". Why do you think this happened? Do they not have air? Answer in markdown.

Exercise 8#

How many pollution polygons did you have in your raw data and how many ended up in the merged data (i.e., how many rows of the merged data have SO2_predicted values)? Store your answers in "ex8_n_in_raw" and "ex8_n_in_merged". Why did this happen? (Note that this discrepancy is not necessarily as much of a problem as Ex 7, but is still an issue.) Answer in Markdown.

Exercise 9#

For each district, calculate the average value of SO2. That should give you a dataset with one observation per electoral district (or less than one per district, given your answer to Ex 7).

Then calculate correlation across those <70 districts between SO2 and incumbent party vote share. Store the result in "ex9_within_corr".

Exercise 10#

Now that we’ve seen some of the issues that arise when we limit our attention to polygons that fall fully within each electoral district, why don’t we try the opposite: let’s try a spatial join in which we match electoral districts with any air pollution polygons that intersect the electoral district.

Before we try, can you predict what will happen?

Store the number of Districts with no pollution data in "ex10_no_pollution" and the number of pollution records (rows with pollution data) in "ex10_n_in_merged". How does "ex10_n_in_merged" compare to "ex8_n_in_raw" and "ex8_n_in_merged"?

Exercise 11#

For each district, calculate the average value of SO2. That should give you a dataset with one observation per electoral district.

Then calculate correlation across those 70 districts between SO2 and incumbent party vote share. Store the result in "ex11_intersect_corr".

Exercise 12#

Now that you’ve seen the two extremes of polygon merging, let’s look for something in the middle.

First, we’ll take try the “quick and dirty” strategy.

Take your pollution data, and replace your polygons with centroids. A centroid is a single point in the “middle” of a polygon, where “middle” is defined as the point where you could balance the polygon on a pin if you cut it out of a uniform material.

Make sure you both construct your centroids and make them the active geometry before merging!

Then merge your Districts with all the centroids they contain.

Finally, as before, calculate the average value of SO2 per district. That should give you a dataset with one observation per electoral district.

Then calculate correlation across those 70 districts between SO2 and incumbent party vote share. Store the result in "ex12_within_corr".

(Again, be careful with the choice of predicate and review the note in Ex 6)

Exercise 13#

And now for the final, most precise but also most complicated solution.

Basically, we are going to say that the air pollution in each constituency is an area-weighed average of the pollution in each intersecting polygon. That means that all polygons that are fully within a district will get equal weight, but a polygon that is only half-in a district will only get half-weight.

To do so, we will first have to overlay our two polygon layers to create a new GeoDataFrame. In particular, we will need to intersect our polygon layers. If you don’t remember set operations, please review them now!

Intersect your two layers. Store the resulting number of polygons in "ex13_num_intersected_polygons".

Exercise 14#

Now in order to take an area-weighted average, we will need each polygon’s area! Create a new column with each polygon’s area using .area.

Exercise 15#

Finally, we need to do a good old-fashioned tabular groupby to calculate a weighted average.

Basically, for each District, we want to calculate:

\[\sum_{i \in D} \frac{area_i}{(\sum_{i \in D} area_i)} * \text{SO2\_predicted}_i \]

Where \(i\) is an index of all polygons in District D.

This weighted average looks scary at first, but it isn’t bad, I promise.

If \(area_i = 1\) for all \(i\), for example, then we’d just be adding up all the SO2 predicted values and dividing by N.

In Python-speak (instead of math-speak), for each district we want to do:

district_size = polygons_in_district["area"].sum()

numerator = 0
for row in polygons_in_district.itertuples():
    numerator += row.area * row.SO2_predicted

district_area_weighted_SO2 = numerator / district_size

Now as with most things data science, I wouldn’t do it in a loop, but if you find that helpful there it is! Personally, I’d use groupby and transform, but you can also use groupby and merge.

After you have calculated this weighted average for each constituency, calculate the correlation between SO2 and incumbent party vote share and store it in "ex15_weighted_corr"

Discussion#

At this point, you’ve probably realized that air pollution is not negatively correlated with the vote share of the incumbent party! Honestly, that shouldn’t be too surprising — this type of cross-sectional correlation is not a good way to answer causal questions. Odds are, air pollution is worse in the city center, and the people in the city center are different from people who live in the areas around central Delhi, and those differences are what are driving support for AAP. If we were being more rigorous, we would do something like correlate changes in air pollution with changes in AAP support.

Nevertheless, hopefully this has given you a feel for some of the challenges of spatial joins, and also given you a chance to see how geometric operations can be paired with spatial joins in interesting ways!