Fixing Data Cleanliness Problems

One you find a problem in your data, the next step is to correct it, and there are (broadly speaking) two ways to do so in pandas: either by using a general cleaning function, or by directly editing an entry. In this section, I’ll provide an overview of how to do both!

Cleaning Functions

Because we so often have to fix “bad” values in our data, pandas comes with a range of in-built tools for data cleaning. I won’t try and cover all of them, but here are a few of the most important to know.

.replace()

Probably the first go-to tool for fixing data issues is the .replace() method, which, as the name suggests, finds specific values in a Series and replaces them with other values. For example, let’s begin with this little dataset we’ve used a few times before with countries, their regions, GDP per capita, and Polity IV scores (a measure of how close countries are to being liberal democracies):

[1]:
import pandas as pd
smallworld = pd.read_csv('https://raw.githubusercontent.com/nickeubank/practicaldatascience/master/Example_Data/world-very-small.csv')
smallworld
[1]:
country region gdppcap08 polityIV
0 Brazil S. America 10296 18
1 Germany W. Europe 35613 20
2 Mexico N. America 14495 18
3 Mozambique Africa 855 16
4 Russia C&E Europe 16139 17
5 Ukraine C&E Europe 7271 16

Now suppose that we want to change country names from their anglicized names to what their own citizens would call them. Here we start by replacing “Germany” with “Deutschland”:

[2]:
smallworld['country'] = smallworld.country.replace('Germany', 'Deutschland')
smallworld
[2]:
country region gdppcap08 polityIV
0 Brazil S. America 10296 18
1 Deutschland W. Europe 35613 20
2 Mexico N. America 14495 18
3 Mozambique Africa 855 16
4 Russia C&E Europe 16139 17
5 Ukraine C&E Europe 7271 16

Voila! And this works for numbers too – so suppose we learned that Polity had mis-coded all scores of 15 into 16, so we want to change them back:

[3]:
smallworld['polityIV'] = smallworld.polityIV.replace(16, 15)
smallworld
[3]:
country region gdppcap08 polityIV
0 Brazil S. America 10296 18
1 Deutschland W. Europe 35613 20
2 Mexico N. America 14495 18
3 Mozambique Africa 855 15
4 Russia C&E Europe 16139 17
5 Ukraine C&E Europe 7271 15

Note here that .replace() replaces all entries with a given value (here, both Mozambique and Ukraine). If you wanted to, say, just change Mozambique’s score because of a recent coup, you’d need a different tool, which we’ll discuss before.

Note that replace will also accept a dictionary instead of two distinct values. If this happens, it will replace each occurance of all keys in the dictionary with their associated values. For example, we can continue our changing of country names to their native spelling this way:

[4]:
changes = {'Mexico': 'México',
           'Russia': 'Российская Федерация',
           'Ukraine': 'Україна'}
smallworld['country'] = smallworld.country.replace(changes)
smallworld
[4]:
country region gdppcap08 polityIV
0 Brazil S. America 10296 18
1 Deutschland W. Europe 35613 20
2 México N. America 14495 18
3 Mozambique Africa 855 15
4 Российская Федерация C&E Europe 16139 17
5 Україна C&E Europe 7271 15

.str. Methods

Of all the formats of data you’ll come across, none are more prone to problems than strings. There are just SO MANY ways to get problems in strings – capitalization issues, differences in spellings, differences in accents, etc. So pandas has a special set of methods you access with the .str. suffix for strings, including:

  • .str.lower() / .str.upper(): Change the case of strings (there are lots of these formatting methods…)

  • .str.contains(): Look for a substring, return True if found

  • .str.isnumeric(): True if value could be converted to a number easily (e.g. “10”), returns False if not (e.g. “Nick”)

  • .str.strip(): Removes whitespaces at start or end of strings (a common cleanliness problem)

For a full list, head over to the official pandas docs here!

In addition, there is one special .str. method to know about: .str.replace(). This method is similar to the regular .replace() method, except that it works with regular expressions, not just exact matches.

So suppose, for example, I wanted to remove all the . after the compass directions in the regions column of our data. I could use replace and write out a dictionary like:

[5]:
changes = {'S. America': 'S America',
           'W. Europe': 'W Europe'}

But that’s really inefficient, as there may be dozens of different regions with periods in a big dataset! So instead I can write:

[6]:
smallworld['region'] = smallworld.region.str.replace("\.", "")
smallworld
[6]:
country region gdppcap08 polityIV
0 Brazil S America 10296 18
1 Deutschland W Europe 35613 20
2 México N America 14495 18
3 Mozambique Africa 855 15
4 Российская Федерация C&E Europe 16139 17
5 Україна C&E Europe 7271 15

Note I had to put a backslash before the . – because .str.replace is using regular expressions, some characters (like periods, parentheses, astrixes, etc.) have special meanings. Review this reading on strings if that isn’t familiar.

Setting Individual Values

These methods are useful for a lot of changes, but sometimes we need more precision than we get from these generalized methods. For exmaple, suppose Mozambique experienced a coup, so we want to set its Polity score to 5. We can’t use replace(15, 5), because that would also change the value for Russia, which also has a value of 15.

In these circumstances, we need to directly access certain cells.

Before diving into how we do this in pandas, though, it may be helpful to review how we do this with other data structures in Python. For example, let’s review how to edit an entry in a list with []:

[7]:
my_list = [1, 2, 3]
my_list[2]
[7]:
3
[8]:
my_list[2] = -42
my_list
[8]:
[1, 2, -42]

As we can see, when we write my_list[2] on the left side of the assignment operator (a single equals sign), then whatever we put on the right side of the assignment operator is being assigned into the entry with index 2 of the list.

As you may recall, this same logic can also be extended to two dimensions in numpy arrays. Consider the following:

[9]:
import numpy as np
my_array = np.array([[1, 2], [3, 4]])
my_array
[9]:
array([[1, 2],
       [3, 4]])
[10]:
my_array[1,1] = -42
my_array
[10]:
array([[  1,   2],
       [  3, -42]])

Now that we’ve had that refresher, we can extend this logic to our pandas DataFrames. For example, using .iloc, we can make the same kinds of manipulations we just made with a numpy array:

[11]:
df = pd.DataFrame({'a': [1, 2, 3, 4], 'b': [5, 6, 7, 8]})
df
[11]:
a b
0 1 5
1 2 6
2 3 7
3 4 8
[12]:
df.iloc[1,1] = -42
df
[12]:
a b
0 1 5
1 2 -42
2 3 7
3 4 8

But this alone is only kinda useful. After all, our datasets are usually very large, and we rarely want to make modifications to cells whose indices we already know. But thankfully, in pandas we can pass boolean vectors to .loc to identify all rows that meet certain conditions and assign values to those specific cells. For example, suppose we wanted to set b to 0 for all rows where a is even. We could do:

[13]:
# Recall that x % 2 gives the remainder after
# dividing x by 2

df.loc[df.a % 2 == 0, 'b'] = 0
df
[13]:
a b
0 1 5
1 2 0
2 3 7
3 4 0

See how the boolean vector on the left subset for rows where a was even (the value of a % 2 is zero), and the second entry (b) subset for the column b, then we assigned 0 into those cells? It’s just a generalization of the kinds of assignments we did above with lists and numpy arrays, just using boolean vectors and column labels instead of indices!

Great! But now suppose we don’t just want to set certain values to a constant, but instead we wanted to, say, double all the values in odd rows. We can do that to by assigning values that “fit” into the cells on the left of the assignment operator (i.e. by making sure the values we assign have the same dimensions as the cells into which we’re trying to assign them):

[14]:
df.loc[df.a % 2 == 1, 'b'] = df.loc[df.a % 2 == 1, 'b'] * 2
df
[14]:
a b
0 1 10
1 2 0
2 3 14
3 4 0

Our Mozambique Edit

OK, so let’s circle back to our desire to edit our Polity IV value for Mozambique. How would we use this technique here?

[15]:
smallworld
[15]:
country region gdppcap08 polityIV
0 Brazil S America 10296 18
1 Deutschland W Europe 35613 20
2 México N America 14495 18
3 Mozambique Africa 855 15
4 Российская Федерация C&E Europe 16139 17
5 Україна C&E Europe 7271 15

Well, we want to put our edit in the row for Mozambique (that’s the row index), and put our edit in the column polityIV, so:

[16]:
smallworld.loc[smallworld.country == "Mozambique", "polityIV"] = 5
smallworld
[16]:
country region gdppcap08 polityIV
0 Brazil S America 10296 18
1 Deutschland W Europe 35613 20
2 México N America 14495 18
3 Mozambique Africa 855 5
4 Российская Федерация C&E Europe 16139 17
5 Україна C&E Europe 7271 15

Voila!

And that’s how you make precise edits in pandas.

Missing Data

A last note on editing missing values. Let’s suppose we have no polity score for Mozambique:

[17]:
smallworld.loc[smallworld.country == "Mozambique", "polityIV"] = np.nan
smallworld
[17]:
country region gdppcap08 polityIV
0 Brazil S America 10296 18.0
1 Deutschland W Europe 35613 20.0
2 México N America 14495 18.0
3 Mozambique Africa 855 NaN
4 Российская Федерация C&E Europe 16139 17.0
5 Україна C&E Europe 7271 15.0

Now if we wanted to subset the rows with no polity scores, you might think we’d do:

[18]:
smallworld.loc[smallworld.polityIV == np.nan]
[18]:
country region gdppcap08 polityIV

Unfortunately this returns nothing, and that’s because of a quirk of missing values: in most programming langauges, missing values don’t equal themselves:

[19]:
np.nan == np.nan
[19]:
False

Note this is true of np.nan, but not None for some reason:

[20]:
None == None
[20]:
True

So we need an entirely new set of tools for testing for the existance of missing values: pd.isnull() and pd.notnull():

[21]:
pd.isnull(np.nan)
[21]:
True
[22]:
pd.isnull(None)
[22]:
True
[23]:
pd.notnull(np.nan)
[23]:
False

Yay! So now we can subset like this:

[24]:
smallworld.loc[pd.isnull(smallworld.polityIV)]
[24]:
country region gdppcap08 polityIV
3 Mozambique Africa 855 NaN

Or by using these as methods (at the end), not functions (up front):

[25]:
smallworld.loc[smallworld.polityIV.isnull()]
[25]:
country region gdppcap08 polityIV
3 Mozambique Africa 855 NaN

In addition, pandas also has some additional missing value tools that you may find useful:

Other helper functions include:

  • .fillna(x) Replace missing values with the value of x

  • .dropna() Drop rows with missing values.

And that’s how, once we’ve found problems, we can fix problems in our data!