Views and Copies in pandas

In our numpy exercises, we discussed in detail how, when one takes a slice of an array, what one gets is not an entirely new array, but rather a view of the original array. Views share the underlying data of the array from which they were spawned, meaning changes to one impact the other. pandas also often exhibits this behavior, but in some much more nuanced and often deeply problematic ways.

Subsetting Series or DataFrames in pandas will also sometimes generate views, but will also sometimes not. This differs from how views work in numpy: in numpy, the rules for when you get views and when you don’t are a little complicated, but they are consistent: certain behaviors (like a basic slice) will always return a view, and others (fancy slicing) will never return a view.

But in pandas, whether you get a view or not depends on the structure of the DataFrame and, if you are trying to modify a slice, the nature of the modification. To illustrate, here is an example where a slice returns a view, such that changes in the original dataframe df propagate to my_slice:

[1]:
import pandas as pd
import numpy as np
df = pd.DataFrame({'a':np.arange(4), 'b':np.arange(4)})
df
[1]:
a b
0 0 0
1 1 1
2 2 2
3 3 3
[2]:
my_slice = df.iloc[1:3,]
my_slice
[2]:
a b
1 1 1
2 2 2
[3]:
df.iloc[1,1] = -1
df
[3]:
a b
0 0 0
1 1 -1
2 2 2
3 3 3
[4]:
my_slice
[4]:
a b
1 1 -1
2 2 2

But here, even though I’m doing the same operation, the changes I make to df no longer propagate tomy_slice.

(Why this happens isn’t actually important to understand, but for those who are interested: this is because in the first modification, I replaced one integer with another, so that operation could be done in the existing integer array; in the second, I try to put a floating point number into an integer array. This can’t be done, so a new floating point array was created, and that new array replaced the old one as column a in the original DataFrame, breaking the “view” connection.)

[5]:
df.iloc[1,0] = 3.14
df
[5]:
a b
0 0.00 0
1 3.14 -1
2 2.00 2
3 3.00 3
[6]:
my_slice
[6]:
a b
1 1 -1
2 2 2

Note that this behavior applies not just to row slices, but also column slices:

[7]:
df
[7]:
a b
0 0.00 0
1 3.14 -1
2 2.00 2
3 3.00 3
[8]:
# This initial change propagates
column_a = df['a']
df.iloc[0,0] = -42
column_a
[8]:
0   -42.00
1     3.14
2     2.00
3     3.00
Name: a, dtype: float64
[9]:
# But this does not
df.iloc[0,0] = "a"
df
[9]:
a b
0 a 0
1 3.14 -1
2 2 2
3 3 3
[10]:
column_a
[10]:
0   -42.00
1     3.14
2     2.00
3     3.00
Name: a, dtype: float64

How to deal with views in pandas

I won’t mince words: I think this behavior deeply problematic, and I’ve long advocated for it to be changed. There is a plan to eventually fix this behavior, but that plan has been on the shelf for years now, so who knows when it might arrive.

The Good News

To help address this issue, pandas has a built in alert system to inform you if you try to modify something that might be a view. For example:

[11]:
df = pd.DataFrame({'a':np.arange(4), 'b':['w', 'x', 'y', 'z']})
my_slice = df.iloc[1:3,]
my_slice
[11]:
a b
1 1 x
2 2 y
[12]:
my_slice.iloc[0,1] = 2
/Users/Nick/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py:543: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s

This alert works really well, and is meant to alert you whenever you’re making a modification to something that might (or might not) be a view. Generally speaking, whenever you see this warning, the solution is simple: make a copy of the thing that might be a view so that you know that it is not:

[13]:
my_slice = my_slice.copy()
my_slice.iloc[0,1] = 2

The Bad News

But this is only a partial fix, because while this warns you when you are modifying something that may be a few, there is no system for alerting you to the possibility that changes to your original DataFrame may impact slices you may have saved.

[14]:
df = pd.DataFrame({'a':np.arange(4), 'b':['w', 'x', 'y', 'z']})
my_slice = df.iloc[1:3,]
my_slice
[14]:
a b
1 1 x
2 2 y
[15]:
df.iloc[1,1] = -1
my_slice
[15]:
a b
1 1 -1
2 2 y

So, what can you do about this? Honestly, I think the only reasonable answer is “if you make a slice for any purpose other than immediately analyzing, you should add .copy() to that slice. It’s ugly, it shouldn’t be necessary, but it’s the only way to always be safe.

[17]:
df = pd.DataFrame({'a':np.arange(4), 'b':['w', 'x', 'y', 'z']})
my_slice = df.iloc[1:3,].copy()
my_slice
[17]:
a b
1 1 x
2 2 y

No, the problem doesn’t only emerge when you change a columns type

Some readers may have noticed a pattern in the illustrations I’ve presented, and from them developed an intuition that a column will only lose it’s “view-ness” when one changes the datatype of that column. Though this will always cause problems, it is not the only place problems can arise. What follows isn’t something you need to know, but may be useful if you’re deeply interested.

In the examples above, each column was it’s own object, and so behaved independently. But this is not always the case in pandas. If a DataFrame is created from a single numpy matrix with multiple columns, pandas will try to be efficient by just keeping that matrix intact.

But as a result, if you do something (like change the type) of one of the columns that is tied to that matrix, pandas will create new arrays to back all the columns that were once tied to the matrix. As a result, a view of a single column can stop being a view due to changes to a different column. For example:

[19]:
my_matrix = np.arange(6).reshape(3,2)
my_matrix
[19]:
array([[0, 1],
       [2, 3],
       [4, 5]])
[20]:
df = pd.DataFrame(my_matrix, columns=['a', 'b'])
df
[20]:
a b
0 0 1
1 2 3
2 4 5
[21]:
# Column_a starts of it's life as a view
column_a = df['a']
df.iloc[0, 0] = -42
column_a
[21]:
0   -42
1     2
2     4
Name: a, dtype: int64
[22]:
# But if I make a change to column b...
df.loc[0, 'b'] = "new entry"
df
[22]:
a b
0 -42 new entry
1 2 3
2 4 5
[24]:
# Then the same type of change to column a of `df` will no longer
# be shared

df.iloc[0, 0] = 42
column_a
[24]:
0   -42
1     2
2     4
Name: a, dtype: int64

So, as noted before: it is best to never to try and infer whether a subset of a DataFrame if a view or a copy until you have explicitly made a copy with .copy().

[ ]: