Cleaning Data

1) For our data cleaning exercises, we will return one last time to our ACS data here. Download and import the 10percent ACS sample.

2) For our exercises today, we’ll focus on age, gender, educ, and inctot. Subset your data to those variables, and quickly look at a sample of 10 rows.

3) First, replace all the values of inctot that are 9999999 with np.nan.

4) So we know how data is being stored, check the dtypes of all the variables we are working with. What is the dtype of age?

5) We want to be able to calculate things using age, so we need it to be a numeric type. Check all the values of age to figure out why it’s categorical and not numeric. You should find two problematic categories.

6) In order to convert age into a numeric variable, we need to replace those problematic entries with values that pandas can later convert into numbers. Pick appropriate substitutions for the existing values and replace the current values. Hint 1: Categorical variables act like strings, so you might want to use string methods! Hint 2: Remember that characters like parentheses, pluses, asterices, etc. are special in Python strings, and you have to escape them if you want them to be interpreted literally!

7) Now convert age from a categorical to numeric.

8) Let’s now filter out anyone in our data whose age is greater than 18. Note that before made age a numeric variable, we couldn’t do this!

9) Create an indicator variable for whether each person has at least a college degree called college_degree.

10) Let’s examine how the educational gender gap. Use pd.crosstab to create a cross-tabulation of sex and college_degree. pd.crosstab will give you the number of people who have each combination of sex and college_degree (so in this case, it will give us a 2x2 table with Male and Female as rows, and college_degree True and False as columns, or vice versa.

11) Counts are kind of hard to interpret. pd.crosstab can also normalize values to give percentages. Look at the pd.crosstab help file to figure out how to normalize the values in the table. Normalize them so that you get the share of men with and without college degree, and the share of women with and without college degrees.

12) Now, let’s recreate that table for people over 40 and people under 40. Has the difference between men and women in terms of getting a college degree impoved, stayed the same, or worsened?