Learning about Durham, NC#

In this exercise, we will work with demographic data from the US Census bureau on the Durham area (this exercise is being written primarily for a Duke class, so feels appropriate).

Gradescope Autograding#

Please follow all standard guidance for submitting this assignment to the Gradescope autograder, including storing your solutions in a dictionary called results and ensuring your notebook runs from the start to completion without any errors.

For this assignment, please name your file exercise_census.ipynb before uploading.

You can check that you have answers for all questions in your results dictionary with this code:

assert set(results.keys()) == {
    "ex4_blockgroup_mean",
    "ex4_blockgroup_median",
    "ex4_blockgroup_min",
    "ex4_blockgroup_max",
    "ex6_area_mean",
    "ex6_area_min",
    "ex6_area_max",
    "ex7_pop_density",
    "ex8_density_coef",
}

Submission Limits#

Please remember that you are only allowed THREE submissions to the autograder. Your last submission (if you submit 3 or fewer times), or your third submission (if you submit more than 3 times) will determine your grade Submissions that error out will not count against this total.

Exercise 1#

We will start with demographic data from the US Census Bureau. In particular, we’ll be working with data from the US Census Bureau’s American Community Survey (ACS) from 2019-2023.

Unlike “The” decennial US census, which is conducted once every ten-years and aims to ask questions of every American, the ACS is a survey of a sample of Americans run regularly. Unlike the decennial US census, which primarily aims to count all Americans and collect some basic demographic information, the ACS asks questions about all sorts of outcomes, like income, employment, education, health, etc. As a result, it is the main dataset commonly used by researchers.

While the ACS is run annually, we will be working with a 5-year sample, in which data from all the surveys conducted between 2019 and 2023 have been pooled. This is done for two reasons. The first is that this pooling results in more statistical power for analyzing things we aren’t worried change a lot from year to year. But the bigger reason is that the US Census Bureau worries a lot about privacy, and will never release data they think could result in information about the specific individuals who answered the survey to be determined. Historically, they would prevent this by aggregating data before release, where the amount of aggregation was chosen to ensure enough individuals contribute to each data point that the data isn’t leaking information about specific individuals. And because the 5-year sample has more people in it, the Census Bureau can provide more granular data without concern that the data will leak information.

(You can learn more about all the various forms of data the US Census Bureau publishes online here).

Please load this ACS data into geopandas using the gpd.read_file function to read this file.

Exercise 2#

Use the .plot() method to visualize the data. Based on this visualization, what sample has been included in the data?

Exercise 3#

Our interest is JUST in data from Durham County. As you saw in the reading, the best part of geopandas is that it is a super-set of pandas — anything you can do in pandas you can do with geopandas! So using your traditional pandas skills, subset the data to Durham, then plot the data again to make sure your subsetting worked. (If you don’t know what Durham County looks like, you can google it to make sure you got what you wanted).

Exercise 4#

When government agencies release data that has been geocoded like this, they generally don’t release information on the precise locations of individual respondents (again, to protect people’s privacy). Instead, they will either:

  1. Release data on individual respondent’s answers, but only provide approximate geographic information on the location of a respondent (like the county or zip code in which they reside), or

  2. They will pre-aggregate responses from individuals within a geographic area and release those aggregated statistics along with the geographic bounds of the area being aggregated.

This data is an example of #2 — each observation in this data is one “census block group,” and all the statistics in the data have been aggregated up to that level.

Block groups are the second most granular geographic unit used by the US Census Bureau, and the smallest unit for which most ACS data is published. (Blocks are the smallest unit).

Calculate population statistics for the block groups in Durham County. In particular, calculate mean, median, min and max populations.

Store your answers in your results dictionary under the keys ex4_blockgroup_mean, ex4_blockgroup_median, ex4_blockgroup_min, and ex4_blockgroup_max.

Can’t tell what column contains population counts? Welcome to US Census Data! Census variables names are famously awfully. You’ll probably need to use the codebook for this data, which you can find here.

Note: Some of the variables in the data are the estimates of the actual values for the things we care about, while other variables provide data on level of uncertainty surrounding these estimates. Estimated values will generally have an E before the last three numbers of the variable name, while the data on uncertainty (the “margins of error”) will have an M. We can ignore the data on uncertainty for now.

Note 2: In this data, each row is a Block Group. But it’s easy to get confused about this if you look at the BLKGRPA variable, which only has 8 unique values. The reason is that census identifiers are built by concatenating hierarchical codes, and this is just the last digit of those codes. That variable will uniquely identifies block groups within each census tract. Getting a fully unique block group identifier requires concatenating the state code with the county code with the census tract code with the block group code. In fact, you have that series of concatenations! It’s the GISJOIN variable after the first G. 37 is the state code of North Carolina, 006 is the Durham County FIPS code within North Carolina, etc.

Exercise 5#

Let’s do some mapping!

Create a chloropleth map (a map of polygons in which the color of each polygon corresponds to an attribute of the polygon) of per capita incomes in Durham. Be sure to include a legend! Note you will surely want to use the codebook again.

NOTE: make sure to look at the values of your variable before you use it. Remember the census bureau often likes to use sentinel values (an unusual numeric value) to indicate missing data.

What are the wealthiest parts of Durham?

Exercise 6#

One nice feature of a library like GeoPandas is that many spatial attributes of census block groups can be easily access. For example, use the .area method to get the area of each block group.

Store the min, max, and mean Durham block group sizes in results under ex6_area_min, ex6_area_max, ex6_area_mean. Do you find most block groups are similar in size?

(We’ll talk about how you know this later, but the units are square meters.)

Exercise 7#

Using these areas and population data, calculate the population density (people per square meter) of each polygon. What is the mean population density for Durham block groups? Stores as ex7_pop_density.

Exercise 8#

You’ve now used geospatial census data to do some mapping, get to know census geographies, and calculate some spatial statistics. Now let’s do some statistical inference.

Regress average per capita income on population density for Durham. Store the resulting coefficient on population density in ex8_density_coef. What does this say about how income is distributed in Durham county?