Changing Data Values to Clean Data¶
In a perfect world, our datasets would all arrive containing only accurate data formatted in exactly the way we need. Sadly, the world we live in is not perfect. Either because of clumsy data entry, bad translation between data formats, survey participations trying to be jerks, or the coding errors of other data scientists, messy data is a fact of life. As a result, a key part of being a data scientist is to learn to (a) seek out and identify data cleanliness problems, and (b) learn to correct them. These parts of data science are not nearly as cool or widely discussed as new machine learning packages or the latest learning algorithm, but as anyone who has done data science in the real world will tell you, cleaning (plus merging and reshaping, which we’ll cover soon) will take up the vast majority of your working life. So in this lesson, we’ll talk about tools in
pandas for identifying and correcting problems with your data!
Everything I said above may sound… well, awful, but here’s the good news: once you get into it, data cleaning begins to feel less like grunt work and more like being a detective, and it can actually be a lot of fun. In some of the examples below, I’ll be showing you problems in small toy datasets where you can easily see the problems with the data just by looking at the dataset. In real data, however, things aren’t so easy – when your data has hundreds of thousands of rows, or where data errors are subtle, learning to find and isolate problems can actually be quite fun and challenging!
Here are a few helpful functions we can demonstrate with our old friend the American Community Survey:
import pandas as pd acs = pd.read_stata("https://github.com/nickeubank/MIDS_Data/blob/master/US_AmericanCommunitySurvey/US_ACS_2017_10pct_sample.dta?raw=true") acs = acs[['year', 'sex', 'age', 'inctot', 'empstat']]
.sample(): while it’s tempting to use
.head() to look at the top of your data, it’s usually more valuable to look at a random sample of rows, since data is usually sorted so the first rows are unlikely to be representative:
import numpy.random as npr npr.seed(42) # Setting a seed so we get the same results # every time I run this notebook so # I can write about what what we see # and that won't change. acs.sample(5)
|214500||2017||male||18||0||not in labor force|
|28863||2017||female||less than 1 year old||9999999||n/a|
See? Already we can see some weird values in
inctot (9999999), and that ages cover quite a wide range, and that there are some
n/a values in
We can also use
.describe(). When using
describe(), look in particular at the maximum and minimum values:
Here we can see that the
max value is INSANELY large, and so is probably not a real value, and we can see there are crazy negative values too, which are probably also some kind of code, not a real value. Those negative values are not something we way in our sample above, so we need to check on those!
We also see that
age is not plotted by
describe(), which tells us it’s not being read as a number but rather a string, something else we need to address.
And lastly, we can use
.value_counts() to see all the possible values of certain variables, like
empstat. Note that you can pass
dropna=False as an option to get counts of missing values too!
employed 148758 not in labor force 104676 n/a 57843 unemployed 7727 Name: empstat, dtype: int64
Interesting – so now we know that the variable has a
n/a category, but no values
pandas recognizes as missing (e.g.
np.nan), so we’ll have to clean those up.
And lastly, it’s really good to plot your data. As we discussed before in our plotting exercises, many distributions of data look the same when we calculate summary statistics, but plotting can be super helpful in showing us when things are wrong. Remember how both of these distributions have the same correlations, mean values, and standard devaiations?
And of course, if we want to be more formal about it, we can also see the famous Anscombe’s Quartet of distributions with the same linear regression fits: