Editing Specific Locations¶
In our previous reading, we learned about tools for making global edits on a DataFrame. Those methods are useful for a lot of changes, but sometimes we need more precision than we get from these generalized methods. For example, suppose Mozambique experienced a coup, so we want to set its Polity score to 5. We can’t use replace(15, 5)
, because that would also change the value for Russia, which also has a value of 15 in our data.
In these circumstances, we need to directly edit specific locations in our DataFrame.
Review: Editing Locations in Python and Numpy¶
Before diving into how we do this in pandas, though, it may be helpful to review how we do this with other data structures in Python. For example, let’s review how to edit an entry in a list with []
:
[1]:
my_list = [1, 2, 3]
my_list[2]
[1]:
3
[2]:
my_list[2] = -42
my_list
[2]:
[1, 2, -42]
As we can see, when we write my_list[2]
on the left side of the assignment operator (a single equals sign), then whatever we put on the right side of the assignment operator is being assigned into the entry with index 2 of the list.
As you may recall, this same logic can also be extended to two dimensions in numpy
arrays. Consider the following:
[3]:
import numpy as np
my_array = np.array([[1, 2], [3, 4]])
my_array
[3]:
array([[1, 2],
[3, 4]])
[4]:
# Edit row 1, column 1
# (recall pandas uses 0-based
# indexing, so `1, 1` is, in
# normal parlance, the second row
# and second column.
my_array[1, 1] = -42
my_array
[4]:
array([[ 1, 2],
[ 3, -42]])
Editing Locations in Pandas¶
Now that we’ve had that refresher, we can extend this logic to our pandas
DataFrames. For example, using .iloc
, we can make the same kinds of manipulations we just made with a numpy array
:
[5]:
import pandas as pd
df = pd.DataFrame({"a": [1, 2, 3, 4], "b": [5, 6, 7, 8]})
df
[5]:
a | b | |
---|---|---|
0 | 1 | 5 |
1 | 2 | 6 |
2 | 3 | 7 |
3 | 4 | 8 |
[6]:
# Edit row 1, column 1.
df.iloc[1, 1] = -42
df
[6]:
a | b | |
---|---|---|
0 | 1 | 5 |
1 | 2 | -42 |
2 | 3 | 7 |
3 | 4 | 8 |
But this alone is only kinda useful. After all, our datasets are usually very large, and we rarely want to make modifications to cells whose row numbers we already know. But thankfully, in pandas
we can pass boolean vectors to .loc
to identify all rows that meet certain conditions and assign values to those specific cells. For example, suppose we wanted to set column b
to 0 for all rows where column a
is even. We could do:
[7]:
# Recall that x % 2 gives the remainder after
# dividing x by 2
df.loc[df.a % 2 == 0, "b"] = 0
df
[7]:
a | b | |
---|---|---|
0 | 1 | 5 |
1 | 2 | 0 |
2 | 3 | 7 |
3 | 4 | 0 |
See how the boolean vector on the left subset for rows where a
was even (the value of a % 2
is zero), and the second entry (b
) subset for the column b
, then we assigned 0 into those cells? It’s just a generalization of the kinds of assignments we did above with lists and numpy arrays, just using boolean vectors and column labels instead of indices!
Great! But now suppose we don’t just want to set certain values to a constant, but instead we wanted to, say, double all the values in odd rows. We can do that to by assigning values that “fit” into the cells on the left of the assignment operator (i.e. by making sure the values we assign have the same dimensions as the cells into which we’re trying to assign them):
[8]:
df.loc[df.a % 2 == 1, "b"] = df.loc[df.a % 2 == 1, "b"] * 2
df
[8]:
a | b | |
---|---|---|
0 | 1 | 10 |
1 | 2 | 0 |
2 | 3 | 14 |
3 | 4 | 0 |
Our Mozambique Edit¶
OK, so let’s circle back to our desire to edit our Polity IV value for Mozambique. How would we use this technique here?
[9]:
smallworld = pd.read_csv(
"https://raw.githubusercontent.com/nickeubank/"
"practicaldatascience/master/Example_Data/world-very-small.csv"
)
smallworld
[9]:
country | region | gdppcap08 | polityIV | |
---|---|---|---|---|
0 | Brazil | S. America | 10296 | 18 |
1 | Germany | W. Europe | 35613 | 20 |
2 | Mexico | N. America | 14495 | 18 |
3 | Mozambique | Africa | 855 | 16 |
4 | Russia | C&E Europe | 16139 | 17 |
5 | Ukraine | C&E Europe | 7271 | 16 |
[10]:
smallworld
[10]:
country | region | gdppcap08 | polityIV | |
---|---|---|---|---|
0 | Brazil | S. America | 10296 | 18 |
1 | Germany | W. Europe | 35613 | 20 |
2 | Mexico | N. America | 14495 | 18 |
3 | Mozambique | Africa | 855 | 16 |
4 | Russia | C&E Europe | 16139 | 17 |
5 | Ukraine | C&E Europe | 7271 | 16 |
Well, we want to put our edit in the row for Mozambique (that’s the row index), and put our edit in the column polityIV
, so:
[11]:
smallworld.loc[smallworld.country == "Mozambique", "polityIV"] = 5
smallworld
[11]:
country | region | gdppcap08 | polityIV | |
---|---|---|---|---|
0 | Brazil | S. America | 10296 | 18 |
1 | Germany | W. Europe | 35613 | 20 |
2 | Mexico | N. America | 14495 | 18 |
3 | Mozambique | Africa | 855 | 5 |
4 | Russia | C&E Europe | 16139 | 17 |
5 | Ukraine | C&E Europe | 7271 | 16 |
Voila!
And that’s how you make precise edits in pandas.
Warning: Chained Assignment¶
Note that we’ve made these edits with .loc
to specify BOTH the subset of rows we want AND the column we want to edit. It is critically important that when doing these types of edits you use .loc
to specify both your rows and columns at once. If instead you do these as two separate operations:
[12]:
smallworld[smallworld.country == "Mozambique"]["polityIV"] = 5
/var/folders/tj/s8f2_ks15h315z5thvtnhz8r0000gp/T/ipykernel_88045/2079543889.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
smallworld[smallworld.country == "Mozambique"]["polityIV"] = 5
You will get the SettingWithCopyWarning
we discussed in our reading on views and copies. That’s because it’s possible that when you run smallworld[smallworld.country == "Mozambique"]
, pandas may return an entirely new DataFrame, and the next part of the operation (changing the values of polityIV
) will run against a completely new DataFrame, not smallworld
, and in the end your original smallworld
DataFrame won’t end up being modified at all. This kind of chained assignment will SOMETIMES work, but not ALWAYS, which is why you get that warning.
Categoricals¶
A special note is needed to address one oddity of cleaning variables that are of type Categorical
: Categorical
variables have a lot of advantages, but they have one annoying feature: you can’t insert an arbitrary value into a Categorical. By definition, a Categorical variable can only take on a set number of values.
To illustrate, let’s make our region
variable into a Categorical
(in a bigger dataset, it’d be a good candidate for a Categorical, since it only takes on a few string unique values that, in a big dataset, would be repeated a lot):
[13]:
smallworld["region"] = pd.Categorical(smallworld["region"])
Now suppose that Mexico has gotten tired of being in the same region as the United States, and would much prefer to be considered a part of Central America than North America.
Normally we could just do smallworld.loc[smallworld["country"] == "Mexico", "region"] = "C America"
. But if I try that here, because C America
isn’t one of the established values for the Categorical variable, we get this error (with the middle section omitted for brevity):
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
/Users/Nick/github/mids_coursera/class_3/week_3/32_cleaning_editing_specific_locations.ipynb Cell 29 in <cell line: 1>()
----> 1 smallworld.loc[smallworld["country"] == "Mexico", "region"] = "C America"
File ~/opt/miniconda3/lib/python3.9/site-packages/pandas/core/indexing.py:716, in _LocationIndexer.__setitem__(self, key, value)
713 self._has_valid_setitem_indexer(key)
715 iloc = self if self.name == "iloc" else self.obj.iloc
--> 716 iloc._setitem_with_indexer(indexer, value, self.name)
[...]
File ~/opt/miniconda3/lib/python3.9/site-packages/pandas/core/arrays/categorical.py:1484, in Categorical._validate_scalar(self, fill_value)
1482 fill_value = self._unbox_scalar(fill_value)
1483 else:
-> 1484 raise TypeError(
1485 "Cannot setitem on a Categorical with a new "
1486 f"category ({fill_value}), set the categories first"
1487 )
1488 return fill_value
TypeError: Cannot setitem on a Categorical with a new category (C America), set the categories first
So what do we do?
One option is to modify the allowable categories for the variable with .cat.add_categories()
, then making the edit again:
[ ]:
smallworld["region"] = smallworld["region"].cat.add_categories(["C America"])
smallworld.loc[smallworld["country"] == "Mexico", "region"] = "C America"
But if you aren’t in a situation where you are REALLY short on RAM or where you REALLY care about the read speed of the variable in question, it’s just easier to re-cast the variable to object
dtype and then work with it normally:
[15]:
smallworld["region"] = smallworld["region"].astype("object")
smallworld.loc[smallworld["country"] == "Mexico", "region"] = "C America"
Review¶
Editing individual values can be done using the same type of indexing used for lists and numpy arrays.
.loc
allows for logical tests to enable editing of rows with specific properties.Beware “chained assignment” of the form
df[df["col1"] == "value"]["col2"] = 5
, and heed the warnings it generates.Categoricals
can be annoying when it comes to editing individual values—to set a value to something that’s not currently a category, use.cat.add_categories
or just re-cast the variable to anobject
dtype.