Editing Specific Locations¶

In our previous reading, we learned about tools for making global edits on a DataFrame. Those methods are useful for a lot of changes, but sometimes we need more precision than we get from these generalized methods. For example, suppose Mozambique experienced a coup, so we want to set its Polity score to 5. We can’t use replace(15, 5), because that would also change the value for Russia, which also has a value of 15 in our data.

In these circumstances, we need to directly edit specific locations in our DataFrame.

Review: Editing Locations in Python and Numpy¶

Before diving into how we do this in pandas, though, it may be helpful to review how we do this with other data structures in Python. For example, let’s review how to edit an entry in a list with []:

[1]:

my_list = [1, 2, 3]
my_list[2]

[1]:

[2]:

my_list[2] = -42
my_list

[2]:

[1, 2, -42]

As we can see, when we write my_list[2] on the left side of the assignment operator (a single equals sign), then whatever we put on the right side of the assignment operator is being assigned into the entry with index 2 of the list.

As you may recall, this same logic can also be extended to two dimensions in numpy arrays. Consider the following:

[3]:

import numpy as np

my_array = np.array([[1, 2], [3, 4]])
my_array

[3]:

array([[1, 2],
       [3, 4]])

[4]:

# Edit row 1, column 1
# (recall pandas uses 0-based
# indexing, so `1, 1` is, in
# normal parlance, the second row
# and second column.

my_array[1, 1] = -42
my_array

[4]:

array([[  1,   2],
       [  3, -42]])

Editing Locations in Pandas¶

Now that we’ve had that refresher, we can extend this logic to our pandas DataFrames. For example, using .iloc, we can make the same kinds of manipulations we just made with a numpy array:

[5]:

import pandas as pd
df = pd.DataFrame({"a": [1, 2, 3, 4], "b": [5, 6, 7, 8]})
df

[5]:

	a	b
0	1	5
1	2	6
2	3	7
3	4	8

[6]:

# Edit row 1, column 1.

df.iloc[1, 1] = -42
df

[6]:

	a	b
0	1	5
1	2	-42
2	3	7
3	4	8

But this alone is only kinda useful. After all, our datasets are usually very large, and we rarely want to make modifications to cells whose row numbers we already know. But thankfully, in pandas we can pass boolean vectors to .loc to identify all rows that meet certain conditions and assign values to those specific cells. For example, suppose we wanted to set column b to 0 for all rows where column a is even. We could do:

[7]:

# Recall that x % 2 gives the remainder after
# dividing x by 2

df.loc[df.a % 2 == 0, "b"] = 0
df

[7]:

	a	b
0	1	5
1	2	0
2	3	7
3	4	0

See how the boolean vector on the left subset for rows where a was even (the value of a % 2 is zero), and the second entry (b) subset for the column b, then we assigned 0 into those cells? It’s just a generalization of the kinds of assignments we did above with lists and numpy arrays, just using boolean vectors and column labels instead of indices!

Great! But now suppose we don’t just want to set certain values to a constant, but instead we wanted to, say, double all the values in odd rows. We can do that to by assigning values that “fit” into the cells on the left of the assignment operator (i.e. by making sure the values we assign have the same dimensions as the cells into which we’re trying to assign them):

[8]:

df.loc[df.a % 2 == 1, "b"] = df.loc[df.a % 2 == 1, "b"] * 2
df

[8]:

	a	b
0	1	10
1	2	0
2	3	14
3	4	0

Our Mozambique Edit¶

OK, so let’s circle back to our desire to edit our Polity IV value for Mozambique. How would we use this technique here?

[9]:

smallworld = pd.read_csv(
    "https://raw.githubusercontent.com/nickeubank/"
    "practicaldatascience/master/Example_Data/world-very-small.csv"
)
smallworld

[9]:

	country	region	gdppcap08	polityIV
0	Brazil	S. America	10296	18
1	Germany	W. Europe	35613	20
2	Mexico	N. America	14495	18
3	Mozambique	Africa	855	16
4	Russia	C&E Europe	16139	17
5	Ukraine	C&E Europe	7271	16

[10]:

smallworld

[10]:

	country	region	gdppcap08	polityIV
0	Brazil	S. America	10296	18
1	Germany	W. Europe	35613	20
2	Mexico	N. America	14495	18
3	Mozambique	Africa	855	16
4	Russia	C&E Europe	16139	17
5	Ukraine	C&E Europe	7271	16

Well, we want to put our edit in the row for Mozambique (that’s the row index), and put our edit in the column polityIV, so:

[11]:

smallworld.loc[smallworld.country == "Mozambique", "polityIV"] = 5
smallworld

[11]:

	country	region	gdppcap08	polityIV
0	Brazil	S. America	10296	18
1	Germany	W. Europe	35613	20
2	Mexico	N. America	14495	18
3	Mozambique	Africa	855	5
4	Russia	C&E Europe	16139	17
5	Ukraine	C&E Europe	7271	16

Voila!

And that’s how you make precise edits in pandas.

Warning: Chained Assignment¶

Note that we’ve made these edits with .loc to specify BOTH the subset of rows we want AND the column we want to edit. It is critically important that when doing these types of edits you use .loc to specify both your rows and columns at once. If instead you do these as two separate operations:

[12]:

smallworld[smallworld.country == "Mozambique"]["polityIV"] = 5

/var/folders/tj/s8f2_ks15h315z5thvtnhz8r0000gp/T/ipykernel_84190/2079543889.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  smallworld[smallworld.country == "Mozambique"]["polityIV"] = 5

You will get the SettingWithCopyWarning we discussed in our reading on views and copies. That’s because it’s possible that when you run smallworld[smallworld.country == "Mozambique"], pandas may return an entirely new DataFrame, and the next part of the operation (changing the values of polityIV) will run against a completely new DataFrame, not smallworld, and in the end your original smallworld DataFrame won’t end up being modified at all. This kind of chained assignment will SOMETIMES work, but not ALWAYS, which is why you get that warning.