Durham Air Quality#

In this exercise, we will being to work directly with raster data from satellites.

In particular, we will be working with data from the Sentinel-5P satellite managed by the EU’s Copernicus Programme.

While there are lots of places to get satellite data, different sources can be easier or harder to work with.

For example, satellite raster data generally comes in different processing “Levels”:

  • Level 1 is raw sensor data (and is not for anyone but serious experts),

  • Level 2 data has been processed (so you can interpret the data), but is organized by “orbit”, and files will just provide data for the swath of the earth covered by a single satellite pass.

  • Level 3 is data that’s been processed and collated across orbits, making it the data you’re most likely to work with if you aren’t looking to get data from a precise moment in time.

(There are some sub-categories like Level 2A or Level 1C as well, but we don’t need to get into that.)

For this exercise, we’ll be downloading some Level 3 data from Sentinel-5P from NASA’s GES DISC platform, a very impressive and easy to use clearing house of raster data. Seriously, go poke around — try browsing by topic a little — to see what kind of data they have. In a latter lesson we’ll look at some ways to pull data using Python APIs, but until you’re comfortable with data formats and details, trying to navigate APIs without menus and labels is really hard.

Gradescope Autograding#

Please follow all standard guidance for submitting this assignment to the Gradescope autograder, including storing your solutions in a dictionary called results and ensuring your notebook runs from the start to completion without any errors.

For this assignment, please name your file exercise_rasters.ipynb before uploading.

You can check that you have answers for all questions in your results dictionary with this code:

assert set(results.keys()) == {
    "ex2_mean_july_no2",
    "ex7_mean_nc_july",
    "ex8_mean_nc_year",
    "ex10_mean_durham",
}

Submission Limits#

Please remember that you are only allowed THREE submissions to the autograder. Your last submission (if you submit 3 or fewer times), or your third submission (if you submit more than 3 times) will determine your grade Submissions that error out will not count against this total.

Exercise 1#

You can find the raster data we will be working with here. This was created from this data page by combining monthly average data from August 2024 to July 2025.

To load it properly, use the following imports and read command:

import xarray as xr
import rioxarray

pollution = xr.load_dataset(
    "https://github.com/nickeubank/MIDS_Data/raw/"
    "refs/heads/master/air_pollution/us_no2_raster.nc4",
    engine="h5netcdf",
)

You’ll probably need to install:

  • h5netcdf,

  • fsspec,

  • and aiohttp

even if you already have xarray. The netCDF file type has changed a few times over the years, and h5netcdf seems to be the best library for reading the newest generation of netCDFs. aiohttp helps loading large files over the internet. I have no idea what fsspec is but they say they want it so… 🤷‍♂️

It’s important to import rioxarray even if we don’t use it directly — when imported, it empowers xarray to do things with spatial data it otherwise would not be able to. Basically, all the functionality of rioxarray is exposed through my_xarray_dataset.rio. methods, which appear when imported.

NOTE: This is a big dataset, so loading will take around 5 minutes.

Once you get better with xarray, you may prefer to use open_dataset rather than load_dataset. open_dataset just loads metadata about the dataset, and waits to load actual data till you ask for it (a little like dask). That can be much faster, though while learning xarray you’ll probably keep manipulating the data in different ways, so it’s probably better to use load_dataset and not try to pull the data across the internet over and over.

TIP: You may find it easier to download the netCDF to your computer and load the local version until it’s time to upload to the autograder — it’s MUCH faster that way.

Exercise 2#

Take a look at the data. Make a map of the most recent month of data (July 2025).

Note that unlike a GeoDataFrame, you can’t just use .plot(), since there are too many dimensions to this data to plot in two dimensions without subsetting. It has two data variables — Tropospheric_NO2 and Number_obs (which is the number of satellite passes that contributed to each pixel of data) — and three Coordinate dimensions.

So you’ll need to subset for a single time index (recall the analogues of loc and iloc for pandas are sel and isel in xarray), and specify you want to analyze the Tropospheric_NO2 variable.

Subsetting along a time Coordinate is a little tricky because it is of type “datetime”, but if you pass a date as a string to sel it will try to parse it!

NOTE: when you plot, matplotlib will place 0 in the middle of a diverging colormap. But there aren’t really any negative values in this data — there are a couple pixels set to extreme negative values I think where data is missing. Setting vmin=0 will set 0 at the bottom of the colormap. You should probably also pick a monotonic colormap afterwards.

Also store the average value of Tropospheric NO2 from July 2025 in this data in "ex2_mean_july_no2"

Exercise 3#

A great thing about xarray is that it tries to keep track of the units of measurement for different variables.

In this case, as you may have seen in answering number 2, the units being used result in some very, very large numbers.

Please rescale the current variable so moving forward we are measuring Tropospheric NO2 in units of quadrillions (that’s \(10^{15}\)) of molecules of nitrogen dioxide per cubic centimeter.

Be sure to not just re-scale the numeric values, but to also update the meta-data so it properly reflects the new units.

Exercise 4#

You can probably see the outline of the US in this figure, but it isn’t super clear. To help with readability, it’s often helpful to plot with a vector-based administrative outline.

The easiest way to get US administrative shapefiles is with pygris. For example, you can get a GeoDataFrame with US Census Bureau official country boundaries with:

import pygris
us = pygris.nation()

Note that pygris provides all types of geographic vector data for the US, including states, counties, congressional districts, roads, school districts, census blocks, etc.

(The analogue for international data is pygadm. For example, sudan_gdf = pygadm.Items(name="Sudan", content_level=2) gets you Sudanese District shapes. content_level specifies the administrative level you want. India and China seem to be broken because of some issues around contested territories that will hopefully be fixed soon, though.)

Using pygris, add the outline of the US to your plot. You can do this precisely how you overlaid vector geometries on the same matplotlib axes previously. The one trick is that because our raster is using longitude for x coordinates and latitude for y coordinates, you will need to change the CRS of what you get from pygris to epsg 4326. We haven’t talked about the CRS of our raster yet, so just change the CRS of the US GeoDataFrame explicitly (.to_crs(epsg=4326)).

Exercise 5#

Let’s suppose we’re only really interested in North Carolina. What can we do with all the rest of this raster data? How do we “select” it to drop it?

Welcome to the world of clipping! Clipping is when you subset your raster, either by specifying a range of x and y values to keep (clipping using a bounding box using the .rio.clip_box() method), or by using vector data to clip a region of your raster like a cookie cutter.

Let’s use a North Carolina GeoDataFrame to clip our raster data to just North Carolina.

First, use pygris to get a GeoDataFrame of North Carolina.

At levels of states and below, pygris offers two types of outlines: TIGER/Line polygons and cartographic polygons (which are clipped to shorelines, making them more accurate but a little slower and causes them to need more memory.)

Since we probably don’t need to consider air pollution over water, please use the cartographic boundaries.

Exercise 6#

Now try using .rio.clip. Note that to do so, you’ll need to use a slightly strange syntax — basically, .rio.clip doesn’t quite understand GeoDataFrames, so we have to pull out the geometry column (the same way that when using packages that don’t know about pandas, we have to pull out the underlying numpy array with .values).

Give it a shot, but you’ll probably get an error… when you do read on.

Exercise 7#

Raster data often does not come with an embedded CRS the way shapefiles or geoJSON files do. This raster has meta-data telling us its working with latitudes and longitudes, but it hasn’t actually been interalized into a CRS, which you can see by looking at .rio.crs.

To set the CRS for the raster data, we use the .rio.write_crs() method, and because this is just latitude-longitude, we can use our old friend epsg 4326.

(If you want it to look less squished, plot it onto an ax and the run ax.set_aspect("equal")).

Set the CRS, then clip again.

Plot the clipped data, and store the average value of NO2 for July 2025 under "ex7_mean_nc_july". Round your answer to three decimal places

Exercise 8#

Great!

OK, next step: we have a full year of data here divided by months. What if we just wanted to see the annual average?

Answer: we’d take the .mean()! Take the mean, but think carefully about the dim keyword and what you want to be averaging over…

Plot the annual average NO2 pollution for NC over this year.

For the autograder, store the mean value of these annual averages — that is: $\(mean_{\text{all pixels}}\big( \frac{1}{12} \sum_{m\in \text{months}} NO2_{t,p}\big)\)$ in "ex8_mean_nc_year". Round your answer to 3 decimal places

Exercise 9#

Enough visualizing, let’s do some analyses!

First, get a GeoDataFrame of all counties in North Carolina with pygris. Again, be sure to use cartographic boundaries.

Then, using the xvec package, calculate the mean, median, and maximum NO2 level in each county. Use the average pollution values you calculated in Ex 8 so that when you calculate mean, median, and maximum NO2 values, those are the mean, median, and maximum values across all the pixels in each county where those pixels themselves are the average value for that pixel over the year.

Note the like rioxarray, importing xvec adds .xvec. methods to xarray, so it’s important to import it even though it seems like we aren’t using it.

Put the results back in a GeoDataFrame.

NOTE: Index alignment — that great pandas challenge — will bite you in the butt if you aren’t careful!

WARNING ON PROJECTIONS#

We’ve been taking averages very casually in this exercise, mostly because I want checkpoints for the autograder. Now that we’re talking about zonal statistics (something you’d do for a real analysis), it’s worth noting that when we calculate average NO2, it’s an average over the pixels in an area. But the number of pixels in our raster that are in a given area (such as a county) is shaped in part by the projection we are using.

We haven’t yet discussed reprojecting raster data, so for the moment we’ll leave it in a latitude-longitude grid (epsg 4326). But if this were a real analysis, you’d want to reproject this raster data before calculating these quantities for the same reason you wouldn’t use epsg 4326 to do a GIS analysis with vector data.

Exercise 10#

Now make a histogram of counties’ mean NO2 levels.

Store the mean NO2 for Durham County under "ex10_mean_durham". Round you answer to three decimal places.