{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Views and Copies in pandas" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we reviewed in [our last reading](views_and_copies_numpy_review.ipynb), when we subset a numpy array, the result is not always a new array; sometimes what numpy returns is a *view* of the data in the original array. \n", "\n", "Since pandas Series and DataFrames are backed by numpy arrays, it will probably come as no surprise that something similar sometimes happens in pandas. Unfortunately, while this behavior is relatively straightforward in numpy, in pandas there's just no getting around the fact that it's a hot mess. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The View/Copy Headache in pandas" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In `numpy`, the rules for when you get views and when you don't are a little complicated, but they are consistent: certain behaviors (like simple indexing) will *always* return a view, and others (fancy indexing) will *never* return a view.\n", "\n", "But in `pandas`, whether you get a view or not—and whether changes made to a view will propagate back to the original DataFrame—depends on the structure and data types in the original DataFrame.\n", "\n", "\n", "### An Illustration of The Problem\n", "\n", "To illustrate, here is an example where a slice returns a view, such that changes in the original DataFrame `df` propagate to `my_slice`:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ab
000
111
222
333
\n", "
" ], "text/plain": [ " a b\n", "0 0 0\n", "1 1 1\n", "2 2 2\n", "3 3 3" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "import numpy as np\n", "\n", "df = pd.DataFrame({\"a\": np.arange(4), \"b\": np.arange(4)})\n", "df\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ab
111
222
\n", "
" ], "text/plain": [ " a b\n", "1 1 1\n", "2 2 2" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "my_slice = df.iloc[\n", " 1:3,\n", "]\n", "my_slice\n" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ab
000
11-1
222
333
\n", "
" ], "text/plain": [ " a b\n", "0 0 0\n", "1 1 -1\n", "2 2 2\n", "3 3 3" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.iloc[1, 1] = -1\n", "df\n" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ab
11-1
222
\n", "
" ], "text/plain": [ " a b\n", "1 1 -1\n", "2 2 2" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "my_slice\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now observe as we do the same operation, but now the changes we make to `df` no longer propagate to`my_slice`:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ab
00.000
13.14-1
22.002
33.003
\n", "
" ], "text/plain": [ " a b\n", "0 0.00 0\n", "1 3.14 -1\n", "2 2.00 2\n", "3 3.00 3" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.iloc[1, 0] = 3.14\n", "df\n" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ab
11-1
222
\n", "
" ], "text/plain": [ " a b\n", "1 1 -1\n", "2 2 2" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "my_slice\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(Why this happens isn't actually important to understand, but for those who are interested: this is because in the first modification, I replaced one integer with another, so that operation could be done in the existing integer array; in the second, I try to put a floating point number into an integer array. This can't be done, so a new floating point array was created, and that new array replaced the old one as column `a` in the original DataFrame, breaking the \"view\" connection.)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that this behavior applies not just to row slices, but also column slices:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ab
00.000
13.14-1
22.002
33.003
\n", "
" ], "text/plain": [ " a b\n", "0 0.00 0\n", "1 3.14 -1\n", "2 2.00 2\n", "3 3.00 3" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df\n" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 -42.00\n", "1 3.14\n", "2 2.00\n", "3 3.00\n", "Name: a, dtype: float64" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# This initial change propagates\n", "column_a = df[\"a\"]\n", "df.iloc[0, 0] = -42\n", "column_a\n" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ab
0a0
13.14-1
22.02
33.03
\n", "
" ], "text/plain": [ " a b\n", "0 a 0\n", "1 3.14 -1\n", "2 2.0 2\n", "3 3.0 3" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# But this does not\n", "df.iloc[0, 0] = \"a\"\n", "df\n" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 -42.00\n", "1 3.14\n", "2 2.00\n", "3 3.00\n", "Name: a, dtype: float64" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "column_a\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## How to deal with views in pandas\n", "\n", "I won't mince words: I think this behavior deeply problematic, and I've long advocated for it to be changed. And indeed, there *is* a push to fix this behavior, but that plan has been on the shelf for years now, [so who knows when it might arrive](https://github.com/pandas-dev/pandas/issues/36195#issuecomment-1137706149).\n", "\n", "### The Good News\n", "\n", "To help address this issue, `pandas` has a built-in alert system that will **sometimes** warning you when you're in a situation that may cause problems, called the `SettingWithCopyWarning`, which you can see here:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 0\n", "1 1\n", "2 2\n", "3 3\n", "Name: a, dtype: int64" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame({\"a\": np.arange(4), \"b\": [\"w\", \"x\", \"y\", \"z\"]})\n", "my_slice = df[\"a\"]\n", "my_slice\n", "\n" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/var/folders/fs/h_8_rwsn5hvg9mhp0txgc_s9v6191b/T/ipykernel_41268/1176285234.py:1: SettingWithCopyWarning: \n", "A value is trying to be set on a copy of a slice from a DataFrame\n", "\n", "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n", " my_slice.iloc[1] = 2\n" ] } ], "source": [ "my_slice.iloc[1] = 2\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Any time you see a `SettingWithCopyWarning`, go up to where the possible view was created (in this case, `my_slice = df[\"a\"]`) and add a `.copy()`:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "my_slice = df[\"a\"].copy()\n", "my_slice.iloc[1] = 2\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The Bad News\n", "\n", "The bad news is that the `SettingWithCopyWarning` will only flag one pattern where the copy-view problem crops up. Indeed, if you follow the link provided in the warning, you'll see it wasn't designed to address the copy-view problem *writ large*, but rather a more narrow behavior where the user tries to change a subset of a DataFrame incorrectly (we'll talk more about that in our coming readings). Indeed, you'll notice we didn't get a single `SettingWithCopyWarning` until the section where we started talking about that warning in particular (and I created an example designed to set it off). \n", "\n", "So: if you see a `SettingWithCopyWarning` do **not** ignore it—find where you may have created a view or may have created a copy and add a `.copy()` so the error goes away. **But just because you don't see that warning doesn't mean you're in the clear!** \n", "\n", "Which leads me to what I will admit is an infuriating piece of advice to have to offer: **if you take a subset for any purpose other than immediately analyzing, you should add .copy() to that subsetting.** Seriously. Just when in doubt, `.copy()`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## An Aside: No, the problem doesn't *only* emerge when you change the data type of a column\n", "\n", "Some readers may have noticed a pattern in the illustrations I've presented, and from them developed an intuition that a column will only lose it's \"view-ness\" when one changes the datatype of that column. Though this will always cause problems, it is not the only place problems can arise. What follows isn't something you *need* to know, but may be useful if you're deeply interested. \n", "\n", "In the examples above, each column was it's own object, and so behaved independently. But this is not always the case in `pandas`. If a DataFrame is created from a single numpy matrix with multiple columns, `pandas` will try to be efficient by just keeping that matrix intact. \n", "\n", "But as a result, if you do something (like change the type) of *one* of the columns that is tied to that matrix, `pandas` will create new arrays to back *all* the columns that were once tied to the matrix. As a result, a view of a single column can stop being a view due to changes to a different column. For example:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[0, 1],\n", " [2, 3],\n", " [4, 5]])" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "my_matrix = np.arange(6).reshape(3, 2)\n", "my_matrix\n" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ab
001
123
245
\n", "
" ], "text/plain": [ " a b\n", "0 0 1\n", "1 2 3\n", "2 4 5" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame(my_matrix, columns=[\"a\", \"b\"])\n", "df\n" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 -42\n", "1 2\n", "2 4\n", "Name: a, dtype: int64" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Column_a starts of it's life as a view\n", "column_a = df[\"a\"]\n", "df.iloc[0, 0] = -42\n", "column_a\n" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ab
0-42new entry
123
245
\n", "
" ], "text/plain": [ " a b\n", "0 -42 new entry\n", "1 2 3\n", "2 4 5" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# But if I make a change to column b...\n", "df.loc[0, \"b\"] = \"new entry\"\n", "df\n" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 -42\n", "1 2\n", "2 4\n", "Name: a, dtype: int64" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Then the same type of change to column a of `df` will no longer\n", "# be shared\n", "\n", "df.iloc[0, 0] = 999999\n", "column_a\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So, as noted before: it is best to never to try and infer whether a subset of a DataFrame if a view or a copy until you have *explicitly* made a copy with `.copy()`." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.10.6 ('base')", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.6" }, "vscode": { "interpreter": { "hash": "718fed28bf9f8c7851519acf2fb923cd655120b36de3b67253eeb0428bd33d2d" } } }, "nbformat": 4, "nbformat_minor": 4 }