# Cleaning Data¶

**1)** For our data cleaning exercises, we will return one last time to our ACS data here. Download and import the 10percent ACS sample.

**2)** For our exercises today, we’ll focus on `age`

, `sex`

, `educ`

, and `inctot`

. Subset your data to those variables, and quickly look at a sample of 10 rows.

**3)** First, replace all the values of `inctot`

that are 9999999 with `np.nan`

.

**4)** Calculate the average age of people in our data. What do you get?

**5)** We want to be able to calculate things using age, so we need it to be a numeric type. Check all the values of `age`

to figure out why it’s categorical and not numeric. You should find two problematic categories.

**6)** In order to convert `age`

into a numeric variable, we need to replace those problematic entries with values that `pandas`

can later convert into numbers. Pick appropriate substitutions for the existing values and replace the current values. **Hint 1:** Categorical variables act like strings, so you might want to use string methods! **Hint 2:** Remember that characters like parentheses, pluses, asterices, etc. are special in Python strings, and you have to escape them if you want them to be interpreted literally!

**7)** Now convert age from a categorical to numeric.

**8)** Let’s now filter out anyone in our data whose age is less than than 18. Note that before made `age`

a numeric variable, we couldn’t do this!

**9)** Create an indicator variable for whether each person has at least a college degree called `college_degree`

.

**10)** Let’s examine how the educational gender gap. Use `pd.crosstab`

to create a cross-tabulation of `sex`

and `college_degree`

. `pd.crosstab`

will give you the number of people who have each combination of `sex`

and `college_degree`

(so in this case, it will give us a 2x2 table with Male and Female as rows, and `college_degree`

True and False as columns, or vice versa.

**11)** Counts are kind of hard to interpret. `pd.crosstab`

can also normalize values to give percentages. Look at the `pd.crosstab`

help file to figure out how to normalize the values in the table. Normalize them so that you get the share of men with and without college degree, and the share of women with and without college degrees.

**12)** Now, let’s recreate that table for people over 40 and people under 40. Has the difference between men and women in terms of getting a college degree impoved, stayed the same, or worsened?

## Want More Practice?¶

Calculate the educational racial gap in the United States for White Americans, Black Americans, Hispanic Americans, and other groups.

Note that to do these calculations, you’ll have to deal with the fact that unlike most Americans, the American Census Bureau treats “Hispanic” not as a racial category, but a linguistic one. As a result, the racial category “White” in `race`

actually includes most Hispanic Americans. For this analysis, we wish to work with the mutually exclusive categories of “White, non-Hispanic”, “White, Hispanic”, “Black (Hispanic or non-Hispanic)”, and a category for everyone else.

## Absolutely positively need the solutions?¶

*Don’t use this link until you’ve really, really spent time struggling with your code!* Doing so only results in you cheating yourself.