Mapping Air Quality in New Delhi, India

Mapping Air Quality in New Delhi, India#

New Delhi famously has some of the worst air pollution in the world. In this exercise, we will load data from Delhi air quality monitoring stations and place them on a map of the city.

Gradescope Autograding#

Please follow all standard guidance for submitting this assignment to the Gradescope autograder, including storing your solutions in a dictionary called results and ensuring your notebook runs from the start to completion without any errors.

For this assignment, please name your file exercise_projections.ipynb before uploading.

You can check that you have answers for all questions in your results dictionary with this code:

assert set(results.keys()) == {
    "ex6_world_crs_str",
    "ex9_delhi_mean_so2",
    "ex9_delhi_crs_str",
    "ex12_distance",
    "ex12_nearest_name",
    "ex13_distance_units",
    "ex14_distance",
    "ex14_nearest_name",
    "ex15_change",
    "ex16_units",
    "ex17_utmzone",
    "ex18_distance",
    "ex18_nearest_name",
    "ex18_units",
}

Submission Limits#

Please remember that you are only allowed THREE submissions to the autograder. Your last submission (if you submit 3 or fewer times), or your third submission (if you submit more than 3 times) will determine your grade Submissions that error out will not count against this total.

Exercise 1#

Please load the air polutation data from this URL. Note that this is JUST a csv, so load it with pandas, not geopandas. Take a look at what a few rows look like.

Exercise 2#

As you will see, each observation in this data is a reading of a specific pollutant from an air quality monitoring station. The location of the station is given by two columns: latitude and longitude.

Getting point data in a normal spreadsheet with lat and long is pretty common, and so geopandas actually has a utility for converting this kind of data into a GeoDataFrame: gpd.points_from_xy()

Convert your pandas dataframe into a GeoDataFrame using gpd.points_from_xy() and the following template (obviously changing data frame name and column names):

import geopandas as gpd

gpd.GeoDataFrame(
    df, 
    geometry=gpd.points_from_xy(
        df["Longitude column"], 
        df["Latitude column"]
    )
)

NOTE THE ORDER OF LATITUDE AND LONGITUDE! We say latitude and longitude, but longitudes actually indicate east-west location (so are the x-coordinates) and latitudes indicate north-south location (so are the y-coordinates).

Exercise 3#

Plot the data. Does it look about how you’d expect? You can look up a map of India if you don’t know what it looks like.

Exercise 4#

Let’s overlay these with a world map. Load the world map geojson at this url: nickeubank/practicaldatascience_book and plot the world map. Does the world map look right?

Exercise 5#

Now create a matplotlib axes object and plot both GeoDataFrames onto the axes with different colors.

import matplotlib.pyplot as plt

fig, ax = plt.subplots()
india.plot(ax=ax, color="red")
world.plot(ax=ax, color="blue")

Did it work as expected? Can you explain what happened? (Try to answer before moving on)

Exercise 6#

Seriously, try and answer Ex 5 before reading this.

OK, done?

Well, the problem is that when we plot to matplotlib, matplotlib doesn’t know anything about coordinate reference systems (CRS)/projections. It just sees x and y coordinates.

If you look at the units on the x and y axes of your India and World plots, you will see they look VERY different. For India, the units range from 70 to 95 and 10 to 35. But on the world map, the units range from \(-2^7\) to \(2^7\) and from \(-0.5^7\) to \(2^7\). Why?

Well, let’s look at the coordinate reference systems!

Print out the CRS for both GeoDataFrames. Did that work? You probably are having some issues. Can you tell why? Again, try and answer before scrolling down.

Store the str() representation of the world-map CRS in your results dictionary under the key "ex6_world_crs_str" (in other words, don’t put the CRS object in the dictionary, run str(world.crs) and put that in the dict. Easier for the autograder.).

Exercise 7#

Could you figure out why? They are using different Coordinate Reference Systems! The world map is using EPSG:3857, which if you google you can see is a pseudo-Mercator projection.

Normally, the fix when two data sets have different CRS is just to use to_crs() to convert them to the same CRS. But as you noticed, our India data doesn’t actually have a CRS! Why?

When we created our India data, we told it where to find our x-y coordinates, but we never told it what those x-y coordinates were. The names of the columns told us what they meant, but we never told GeoPandas. That’s why india.crs returns None.

Since we know that the India is latitude and longitude (so WGS84 / epsg=4326), we can now tell that to GeoPandas using the .set_crs() method. Please do so. Note this is DIFFERENT from to_crs() — .set_crs() tells GeoPandas how to interpret the coordinates it already has, not to change them in any way.

Exercise 8#

Now that you’ve set the CRS for the India data, use to_crs to change the crs for the India data to match the world map. Then plot the two datasets on top of one another with India in red and the world in blue.

Does it look better now?

Exercise 9#

Our particular interest in this analysis is the city of Deli, India. Using your normal tabular manipulation skills, subset the India data to Delhi.

Let’s also focus our analysis on the pollutant SO2 by subsetting on pollutant_id.

Plot your data, and color points by average pollution level.

Just to make sure everything’s on the right track, store the str() representation of the current CRS of the delhi data under the key "ex9_delhi_crs_str" and the mean value of pollutant_avg in "ex9_delhi_mean_so2".

Exercise 10#

Using contextily, add a basemap under the air stations in your plot.

Now that you understand about coordinate reference systems, you can probably figure out why cx.add_basemap has a crs argument and you have to pass the CRS of the dataset you are putting the basemap under!

Exercise 11#

Suppose we were interested in figuring out which station is furthest from other stations as a way of figuring out areas where we might want to add air quality monitors.

There are two ways to do this. The easiest is to use the .sjoin_nearest() method.

sjoin_nearest is an example of a method for doing spatial joins (thus the sjoin), which is where one merges to datasets based on their physical relationship to one another (rather than using a common variable). sjoin_nearest, in particular, will merge each record in the first dataset with the record of the nearest instance in the second.

So for example:

df1.sjoin_nearest(df2)

Will return all the rows in df1, but those rows will be augmented with data from the nearest row in df2.

In this case, however, since we are interested in each air quality station’s nearest other air quality station, we will be spatial joining our data to itself: delhi.sjoin_nearest(delhi).

Three final suggestions:

If you want to be able to access the distance to the nearest object — not just merge with the nearest object — you have to use the distance_col argument.
If you merge data with itself, then the nearest record will always be… itself. So GeoPandas has exclusive argument. Set it to True and GeoPandas will get you the closest record that isn’t itself.
Since you don’t need all of the columns from the join, just do sjoin_nearest with the geometry and station column of the Delhi. Otherwise, you’ll have two copies of every column, and GeoPandas will give the columns annoying _left and _right suffix names to avoid confusion.

Calculate each air stations’ distance to the nearest air station and get the name of said nearest station.

Exercise 12#

Pull the record for station "Sonia Vihar, Delhi - DPCC".

How far is the nearest other air quality station, and what is that nearest station’s name?

Store the distance in "ex12_distance" and the nearest station name in "ex12_nearest_name".

Exercise 13#

What are the units of that distance? How do you know?

Hint: you can find the answer in delhi.crs.axis_info as the unit_name. You’ll see one record for each axis — Northing (the y-axis, with positive values to the North) and Easting (the x-axis, with positive values to the East.)

Store your answer as a string in "ex13_distance_units". Americans: you will find the spelling is not the one you prefer. Please use the version provided by the projection.

Exercise 14#

Use to_crs() to reproject your data back to latitude and longitude (epsg=4326). Then recalculate the distance to the nearest station. What is the name of the station that is now nearest to "Sonia Vihar, Delhi - DPCC"? Store your answer in "ex14_nearest_name".

How are away is it? Store your answer under "ex14_distance".

NOTE: You will get a warning from this operation! Can you tell why? This is NOT a warning you should normally ignore!

Exercise 15#

Did the nearest station change? Store your answer as "yes" or "no" in "ex15_change".

Exercise 16#

What are the units of that distance? Again, I’d suggest using .crs.axis_info. Store your answer in "ex16_units".

If I asked you to translate those units to feet, could you? Answer in markdown.

Exercise 17#

Our distance calculations up until now have used two not-very-good choices of projections.

In our first calculations, we were using the CRS that we borrowed from our world map. The CRS is designed to make a world map look reasonable and balance the tradeoffs of projection across the entire globe. But we aren’t working with the entire globe! Something better tuned for Delhi will certainly do better.

In our second, we used latitude-and-longitude as our projection. Because Delhi is near the equator, this is not as terrible a choice as it is when you move further north or south, but the units are meaningless and again, it definitely isn’t optimized.

When analyzing a relatively small geographic area like a city and when you don’t have a good reason to pick a different projection, a good default choice is to use a Universal Transverse Mercator (UTM). UTM projections are a conformal projection, meaning that excel at preserving shape. Since we’re mostly measuring distances, an equidistant projection would probably be a little more appropriate, but distortions at the city level shouldn’t be large given the level of precision we’re looking for.

The UTM system divides the world up into 60 different zones, as shown here.

UTM Zones are such a common “reasonable starting point” choice of projection, though, that you don’t need to dig through these maps to figure out what UTM zone covers your data — you can just ask GeoPandas using the .estimate_utm_crs() method!

This method will give you back a CRS object you can use directly to reproject your data.

Reproject your data using the CRS provided.

Store the name of the UTM zone you’re now using in "ex17_utmzone". Note the full name will start with "WGS 84 / UTM zone" followed by two numbers (a leading zero if required) and either a N or S. Please only put the last three characters (two numbers and a capital N or S). in "ex17_utmzone".

Exercise 18#

Now one last time, calculate each station’s distance to nearest other station (now in this UTM project) and store:

Name of station nearest to "Sonia Vihar, Delhi - DPCC" in "ex18_nearest_name",
Distance to station nearest to "Sonia Vihar, Delhi - DPCC" in "ex18_distance",
Units of distance in "ex18_units" (using same spelling rules as above).

Exercise 19#

Just for fun, an illustration of some cool stuff you can do with point data.

Just as it is possible to fit a model to “fill in the gaps” between data points with normal data, it is also possible to interpolate the values of a variable between spatial points.

The practice is called kriging, and can be accomplished using the pykrige package. Kriging is a domain of modelling onto itself, and we don’t have space to really get into it. But in the interest of showing you what you can do with geospatial stuff, please install pykrige and run the following code to interpolate measures of air pollution in the Delhi area. But please DO try and understand as much of the code as you can.

from pykrige.ok import OrdinaryKriging

# PyKrige only works with numpy arrays, not pandas
delhi_has_data = delhi_utm.dropna()
x_coords = delhi_has_data["geometry"].x.values
y_coords = delhi_has_data["geometry"].y.values
pollution = delhi_has_data["pollutant_avg"].values

# The last two arguments are kinda technical.
# Learn about them if you want to use krigging,
# but just copy for now.
OK = OrdinaryKriging(
    x_coords,
    y_coords,
    pollution,
    variogram_model="exponential",
    variogram_parameters={"sill": 100, "range": 50_000, "nugget": 2},
)

# Krigging will create a grid (a raster)
# with predicted values in each grid square.
# So we have to tell it what the grid will look like.
interval = 1_000
gridx = np.arange(np.min(x_coords), np.max(x_coords) + interval, interval)
gridy = np.arange(np.min(y_coords), np.max(y_coords) + interval, interval)

# Create the predicted values
z, ss = OK.execute("grid", gridx, gridy)

# Set common color scale for both predictions and observations
vmin = min(z.min(), pollution.min())
vmax = max(z.max(), pollution.max())

# Plot everything on one map.
fig, ax = plt.subplots()

# Set limits first so cx.add_basemap knows what tile to pull
# Set the axes limits first
ax.set_xlim(gridx.min(), gridx.max())
ax.set_ylim(gridy.min(), gridy.max())

# Put basemap on bottom
cx.add_basemap(ax, crs=delhi_has_data.crs, source=cx.providers.CartoDB.Positron)

# Add krige predicted values
im = ax.imshow(
    z,
    extent=[gridx.min(), gridx.max(), gridy.min(), gridy.max()],
    origin="lower",
    cmap="bwr",
    alpha=0.1,
    vmin=vmin,
    vmax=vmax,
)

# Put air pollution stations on top
delhi_has_data.plot(
    "pollutant_avg", cmap="bwr", ax=ax, markersize=10, vmin=vmin, vmax=vmax
)

# Add legend from predicted values.
fig.colorbar(im, ax=ax, label="SO2 (µg/m³)")

# always looks better without axes
ax.set_axis_off()