Cleaning Data Exercises

In this exercise, we’ll be returning to the American Community Survey data we used previously to measuring racial income inequality in the United States. In today’s exercise, we’ll be using it to measure the returns to education and how those returns vary by race and gender.

Gradescope Autograding

This assignment will be autograded using the same autograder we used for the Missing Values exercise. As was the case last time, the only requirements for submitting this assignment are:

  • you only import pandas, numpy, and matplotlib.

  • you import data from a URL (since the autograder in the cloud won’t see your file system)

  • you name your notebook file exercise_cleaning.ipynb

  • you store your solutions in a dictionary called results with answers assigned to the keys provided in the prompts below.

  • your notebook runs from the start to completion without any errors. That means that for questions that (deliberately) invite you to write code that causes errors, before submitting comment out the code that generates the error to ensure it won’t interrupt notebook execution.

Your notebook will be executed by the autograder and the value of identified dictionary entries will then be checked against solution values. However, be aware TAs will also review your notebooks for answers to other questions and to ensure your notebook is well written and formatted.

You can check that you have answers for all questions in your results dictionary with this code:

assert set(results.keys()) == {

Submission Limits

Please remember that you are only allowed three submissions to the autograder. Your last submission (if you submit 3 or fewer times), or your third submission (if you submit more than 3 times) will determine your grade.


Exercise 1

For these cleaning exercises, we’ll return to the ACS data we’ve used before one last time. We’ll be working with US_ACS_2017_10pct_sample.dta. Import the data (please use url for the autograder).

Exercise 2

For our exercises today, we’ll focus on age, sex, educ (education), and inctot (total income). Subset your data to those variables, and quickly look at a sample of 10 rows.

Exercise 3

As before, all the values of 9999999 have the potential to cause us real problems, so replace all the values of inctot that are 9999999 with np.nan.

Exercise 4

Attempt to calculate the average age of people in our data. What do you get? Why are you getting that error?


Then talk about the error in a markdown cell.

Exercise 5

We want to be able to calculate things using age, so we need it to be a numeric type. Check the current type of age, and look at all the values of age to figure out why it’s categorical and not numeric. You should find two problematic categories. Store the values of these categories in "ex5_age_young" and "ex5_age_old" (once you find them, it should be clear which is which).

Exercise 6

In order to convert age into a numeric variable, we need to replace those problematic entries with values that pandas can later convert into numbers. Pick appropriate substitutions for the existing values and replace the current values.

Hint 1: Categorical variables act like strings, so you might want to use string methods!

Hint 2: Remember that characters like parentheses, pluses, asterisks, etc. are special in Python strings, and you have to escape them if you want them to be interpreted literally!

Hint 3: Because the US Census has been conducted regularly for hundreds of years but exactly how the census has been conducted have occasionally changed, variables are sometimes coded in a way that might be interpreted in different ways for different census years. For example, hypothetically, one might write 90 (90+ in 1980 and 1990) if the Censuses conducted in 1980 and 1990 used to top-code age at 90 (any values over 90 were just coded as 90), but more recent Censuses no longer top-coded age and recorded ages over 90 as the respondents actual age.

Exercise 7

Now convert age from a categorical to numeric. Calculate the average age amoung this group, and store it in "ex7_avg_age".

Exercise 8

Let’s now filter out anyone in our data whose age is less than 18. Note that before made age a numeric variable, we couldn’t do this! again, calculate the average age and this time store it in "ex8_avg_age".

Use this sample of people 18 and over for all subsequent exercises.

Exercise 9

Create an indicator variable for whether each person has at least a college Bachelor’s degree called college_degree. Use this variable to calculate the number of people in the dataset with a college degree. You may assume that to get a college degree you need to complete at least 4 years of college. Save the result as "ex9_num_college".

Exercise 10

Let’s examine how the educational gender gap. Use pd.crosstab to create a cross-tabulation of sex and college_degree. pd.crosstab will give you the number of people who have each combination of sex and college_degree (so in this case, it will give us a 2x2 table with Male and Female as rows, and college_degree True and False as columns, or vice versa.

Exercise 11

Counts are kind of hard to interpret. pd.crosstab can also normalize values to give percentages. Look at the pd.crosstab help file to figure out how to normalize the values in the table. Normalize them so that you get the share of men with and without college degree, and the share of women with and without college degrees.

Store the share (between 0 and 1) of men with college degrees in "ex11_share_male_w_degrees", and the share of women with degrees in "ex11_share_female_w_degrees".

Exercise 12

Now, let’s recreate that table for people who are 40 and over and people under 40. Over time, what does this suggest about the absolute difference in the share of men and women earning college degrees? Has it gotten larger, stayed the same, or gotten smaller? Store your answer (either "the absolute difference has increased" or "the absolute difference has decreased") in "ex12_comparing".

Exercise 13

In words, what is causing the change noted in Exercise 12 (i.e., looking at the tables above, tell me a story about Men and Women’s College attainment).

Want More Practice?

Calculate the educational racial gap in the United States for White Americans, Black Americans, Hispanic Americans, and other groups.

Note that to do these calculations, you’ll have to deal with the fact that unlike most Americans, the American Census Bureau treats “Hispanic” not as a racial category, but a linguistic one. As a result, the racial category “White” in race actually includes most Hispanic Americans. For this analysis, we wish to work with the mutually exclusive categories of “White, non-Hispanic”, “White, Hispanic”, “Black (Hispanic or non-Hispanic)”, and a category for everyone else.