# Cleaning Data¶

**1)** For our data cleaning exercises, we will return one last time to our ACS data here. Download and import the 10percent ACS sample.

**2)** For our exercises today, we’ll focus on `age`

, `gender`

, `educ`

, and `inctot`

. Subset your data to those variables, and quickly look at a sample of 10 rows.

**3)** First, replace all the values of `inctot`

that are 9999999 with `np.nan`

.

**4)** So we know how data is being stored, check the `dtypes`

of all the variables we are working with. What is the `dtype`

of `age`

?

**5)** We want to be able to calculate things using age, so we need it to be a numeric type. Check all the values of `age`

to figure out why it’s categorical and not numeric. You should find two problematic categories.

**6)** In order to convert `age`

into a numeric variable, we need to replace those problematic entries with values that `pandas`

can later convert into numbers. Pick appropriate substitutions for the existing values and replace the current values. **Hint 1:** Categorical variables act like strings, so you might want to use string methods! **Hint 2:** Remember that characters like parentheses, pluses, asterices, etc. are special in Python strings, and you have to escape them if you want them
to be interpreted literally!

**7)** Now convert age from a categorical to numeric.

**8)** Let’s now filter out anyone in our data whose age is greater than 18. Note that before made `age`

a numeric variable, we couldn’t do this!

**9)** Create an indicator variable for whether each person has at least a college degree called `college_degree`

.

**10)** Let’s examine how the educational gender gap. Use `pd.crosstab`

to create a cross-tabulation of `sex`

and `college_degree`

. `pd.crosstab`

will give you the number of people who have each combination of `sex`

and `college_degree`

(so in this case, it will give us a 2x2 table with Male and Female as rows, and `college_degree`

True and False as columns, or vice versa.

**11)** Counts are kind of hard to interpret. `pd.crosstab`

can also normalize values to give percentages. Look at the `pd.crosstab`

help file to figure out how to normalize the values in the table. Normalize them so that you get the share of men with and without college degree, and the share of women with and without college degrees.

**12)** Now, let’s recreate that table for people over 40 and people under 40. Has the difference between men and women in terms of getting a college degree impoved, stayed the same, or worsened?