Views and Copies in pandas

As we reviewed in our last reading, when we subset a numpy array, the result is not always a new array; sometimes what numpy returns is a view of the data in the original array.

Since pandas Series and DataFrames are backed by numpy arrays, it will probably come as no surprise that something similar sometimes happens in pandas. Unfortunately, while this behavior is relatively straightforward in numpy, in pandas there’s just no getting around the fact that it’s a hot mess.

The View/Copy Headache in pandas

In numpy, the rules for when you get views and when you don’t are a little complicated, but they are consistent: certain behaviors (like simple indexing) will always return a view, and others (fancy indexing) will never return a view.

But in pandas, whether you get a view or not—and whether changes made to a view will propagate back to the original DataFrame—depends on the structure and data types in the original DataFrame.

An Illustration of The Problem

To illustrate, here is an example where a slice returns a view, such that changes in the original DataFrame df propagate to my_slice:

[1]:
import pandas as pd
import numpy as np

df = pd.DataFrame({"a": np.arange(4), "b": np.arange(4)})
df

[1]:
a b
0 0 0
1 1 1
2 2 2
3 3 3
[2]:
my_slice = df.iloc[
    1:3,
]
my_slice

[2]:
a b
1 1 1
2 2 2
[3]:
df.iloc[1, 1] = -1
df

[3]:
a b
0 0 0
1 1 -1
2 2 2
3 3 3
[4]:
my_slice

[4]:
a b
1 1 -1
2 2 2

Now observe as we do the same operation, but now the changes we make to df no longer propagate tomy_slice:

[5]:
df.iloc[1, 0] = 3.14
df

[5]:
a b
0 0.00 0
1 3.14 -1
2 2.00 2
3 3.00 3
[6]:
my_slice

[6]:
a b
1 1 -1
2 2 2

(Why this happens isn’t actually important to understand, but for those who are interested: this is because in the first modification, I replaced one integer with another, so that operation could be done in the existing integer array; in the second, I try to put a floating point number into an integer array. This can’t be done, so a new floating point array was created, and that new array replaced the old one as column a in the original DataFrame, breaking the “view” connection.)

Note that this behavior applies not just to row slices, but also column slices:

[7]:
df

[7]:
a b
0 0.00 0
1 3.14 -1
2 2.00 2
3 3.00 3
[8]:
# This initial change propagates
column_a = df["a"]
df.iloc[0, 0] = -42
column_a

[8]:
0   -42.00
1     3.14
2     2.00
3     3.00
Name: a, dtype: float64
[9]:
# But this does not
df.iloc[0, 0] = "a"
df

[9]:
a b
0 a 0
1 3.14 -1
2 2.0 2
3 3.0 3
[10]:
column_a

[10]:
0   -42.00
1     3.14
2     2.00
3     3.00
Name: a, dtype: float64

How to deal with views in pandas

I won’t mince words: I think this behavior deeply problematic, and I’ve long advocated for it to be changed. And indeed, there is a push to fix this behavior, but that plan has been on the shelf for years now, so who knows when it might arrive.

The Good News

To help address this issue, pandas has a built-in alert system that will sometimes warning you when you’re in a situation that may cause problems, called the SettingWithCopyWarning, which you can see here:

[11]:
df = pd.DataFrame({"a": np.arange(4), "b": ["w", "x", "y", "z"]})
my_slice = df["a"]
my_slice


[11]:
0    0
1    1
2    2
3    3
Name: a, dtype: int64
[12]:
my_slice.iloc[1] = 2

/var/folders/fs/h_8_rwsn5hvg9mhp0txgc_s9v6191b/T/ipykernel_41268/1176285234.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  my_slice.iloc[1] = 2

Any time you see a SettingWithCopyWarning, go up to where the possible view was created (in this case, my_slice = df["a"]) and add a .copy():

[13]:
my_slice = df["a"].copy()
my_slice.iloc[1] = 2

The Bad News

The bad news is that the SettingWithCopyWarning will only flag one pattern where the copy-view problem crops up. Indeed, if you follow the link provided in the warning, you’ll see it wasn’t designed to address the copy-view problem writ large, but rather a more narrow behavior where the user tries to change a subset of a DataFrame incorrectly (we’ll talk more about that in our coming readings). Indeed, you’ll notice we didn’t get a single SettingWithCopyWarning until the section where we started talking about that warning in particular (and I created an example designed to set it off).

So: if you see a SettingWithCopyWarning do not ignore it—find where you may have created a view or may have created a copy and add a .copy() so the error goes away. But just because you don’t see that warning doesn’t mean you’re in the clear!

Which leads me to what I will admit is an infuriating piece of advice to have to offer: if you take a subset for any purpose other than immediately analyzing, you should add .copy() to that subsetting. Seriously. Just when in doubt, .copy().

An Aside: No, the problem doesn’t only emerge when you change the data type of a column

Some readers may have noticed a pattern in the illustrations I’ve presented, and from them developed an intuition that a column will only lose it’s “view-ness” when one changes the datatype of that column. Though this will always cause problems, it is not the only place problems can arise. What follows isn’t something you need to know, but may be useful if you’re deeply interested.

In the examples above, each column was it’s own object, and so behaved independently. But this is not always the case in pandas. If a DataFrame is created from a single numpy matrix with multiple columns, pandas will try to be efficient by just keeping that matrix intact.

But as a result, if you do something (like change the type) of one of the columns that is tied to that matrix, pandas will create new arrays to back all the columns that were once tied to the matrix. As a result, a view of a single column can stop being a view due to changes to a different column. For example:

[14]:
my_matrix = np.arange(6).reshape(3, 2)
my_matrix

[14]:
array([[0, 1],
       [2, 3],
       [4, 5]])
[15]:
df = pd.DataFrame(my_matrix, columns=["a", "b"])
df

[15]:
a b
0 0 1
1 2 3
2 4 5
[16]:
# Column_a starts of it's life as a view
column_a = df["a"]
df.iloc[0, 0] = -42
column_a

[16]:
0   -42
1     2
2     4
Name: a, dtype: int64
[17]:
# But if I make a change to column b...
df.loc[0, "b"] = "new entry"
df

[17]:
a b
0 -42 new entry
1 2 3
2 4 5
[18]:
# Then the same type of change to column a of `df` will no longer
# be shared

df.iloc[0, 0] = 999999
column_a

[18]:
0   -42
1     2
2     4
Name: a, dtype: int64

So, as noted before: it is best to never to try and infer whether a subset of a DataFrame if a view or a copy until you have explicitly made a copy with .copy().