Changing Data Values to Clean Data

In a perfect world, our datasets would all arrive containing only accurate data formatted in exactly the way we need. Sadly, the world we live in is not perfect. Either because of clumsy data entry, bad translation between data formats, survey participations trying to be jerks, or the coding errors of other data scientists, messy data is a fact of life. As a result, a key part of being a data scientist is to learn to (a) seek out and identify data cleanliness problems, and (b) learn to correct them. These parts of data science are not nearly as cool or widely discussed as new machine learning packages or the latest learning algorithm, but as anyone who has done data science in the real world will tell you, cleaning (plus merging and reshaping, which we’ll cover soon) will take up the vast majority of your working life. So in this lesson, we’ll talk about tools in pandas for identifying and correcting problems with your data!

Identifying Problems

Everything I said above may sound… well, awful, but here’s the good news: once you get into it, data cleaning begins to feel less like grunt work and more like being a detective, and it can actually be a lot of fun. In some of the examples below, I’ll be showing you problems in small toy datasets where you can easily see the problems with the data just by looking at the dataset. In real data, however, things aren’t so easy – when your data has hundreds of thousands of rows, or where data errors are subtle, learning to find and isolate problems can actually be quite fun and challenging!

Here are a few helpful functions we can demonstrate with our old friend the American Community Survey:

import pandas as pd
acs = pd.read_stata("")
acs = acs[['year', 'sex', 'age', 'inctot', 'empstat']]

First is .sample(): while it’s tempting to use .head() to look at the top of your data, it’s usually more valuable to look at a random sample of rows, since data is usually sorted so the first rows are unlikely to be representative:

import numpy.random as npr
npr.seed(42) # Setting a seed so we get the same results
             # every time I run this notebook so
             # I can write about what what we see
             # and that won't change.
year sex age inctot empstat
166590 2017 male 62 170000 employed
207895 2017 female 6 9999999 n/a
214500 2017 male 18 0 not in labor force
28863 2017 female less than 1 year old 9999999 n/a
18280 2017 female 11 9999999 n/a

See? Already we can see some weird values in inctot (9999999), and that ages cover quite a wide range, and that there are some n/a values in empstat.

We can also use .describe(). When using describe(), look in particular at the maximum and minimum values:

count 3.190040e+05
mean 1.723646e+06
std 3.732326e+06
min -9.000000e+03
25% 1.050000e+04
50% 3.370000e+04
75% 9.140000e+04
max 9.999999e+06

Here we can see that the max value is INSANELY large, and so is probably not a real value, and we can see there are crazy negative values too, which are probably also some kind of code, not a real value. Those negative values are not something we way in our sample above, so we need to check on those!

We also see that age is not plotted by describe(), which tells us it’s not being read as a number but rather a string, something else we need to address.

And lastly, we can use .value_counts() to see all the possible values of certain variables, like empstat. Note that you can pass dropna=False as an option to get counts of missing values too!

employed              148758
not in labor force    104676
n/a                    57843
unemployed              7727
Name: empstat, dtype: int64

Interesting – so now we know that the variable has a n/a category, but no values pandas recognizes as missing (e.g. np.nan), so we’ll have to clean those up.

And lastly, it’s really good to plot your data. As we discussed before in our plotting exercises, many distributions of data look the same when we calculate summary statistics, but plotting can be super helpful in showing us when things are wrong. Remember how both of these distributions have the same correlations, mean values, and standard devaiations?


And of course, if we want to be more formal about it, we can also see the famous Anscombe’s Quartet of distributions with the same linear regression fits: