{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Index Alignment Exercises, Discussion" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you didn't see the problem coming and change your code, you probably did the following:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 20\n", "1 10\n", "2 0\n", "dtype: int64" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "attendees = pd.DataFrame(\n", " {\n", " \"names\": [\"Jill\", \"Kumar\", \"Zaira\"],\n", " \"prizes\": [0, 0, 0],\n", " \"arrival_order\": [2, 1, 3],\n", " }\n", ")\n", "arrival_prizes = pd.Series([20, 10, 0])\n", "arrival_prizes" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namesprizesarrival_order
1Kumar01
0Jill02
2Zaira03
\n", "
" ], "text/plain": [ " names prizes arrival_order\n", "1 Kumar 0 1\n", "0 Jill 0 2\n", "2 Zaira 0 3" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "attendees = attendees.sort_values(\"arrival_order\")\n", "attendees" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namesprizesarrival_order
1Kumar101
0Jill202
2Zaira03
\n", "
" ], "text/plain": [ " names prizes arrival_order\n", "1 Kumar 10 1\n", "0 Jill 20 2\n", "2 Zaira 0 3" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "attendees[\"prizes\"] = attendees[\"prizes\"] + arrival_prizes\n", "attendees" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The Problem \n", "\n", "Uh oh... as you can see, 20 dollars went to the person who arrived second, and 10 dollars go to the person who arrived first... Why did that happen?\n", "\n", "The answer is index alignment. \n", "\n", "In `numpy` or `R`, when you try and add two arrays of the same length, the first entry of the first array is added to the first entry of the second array to create the first entry of the result; the second entry is added to the second entry, etc. For example:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([2, 4, 6])" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy as np\n", "\n", "np.array([1, 2, 3]) + np.array([1, 2, 3])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But that is NOT how `pandas` operates. Instead, `pandas` will always align data based on **index values**. And when you sort data, the index value associated with each row doesn't change. Take a look at `attendees`: when we sorted the data, the data was re-ordered, but so were the index values: Jill is now the second row, but her index value is still 0, Kumar is now the first row, but his index value is still 1. " ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namesprizesarrival_order
1Kumar101
0Jill202
2Zaira03
\n", "
" ], "text/plain": [ " names prizes arrival_order\n", "1 Kumar 10 1\n", "0 Jill 20 2\n", "2 Zaira 0 3" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "attendees" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The result is that when you combine `attendees['arrival_order']` with `prizes`, the entry of `prizes` with index value `0` (20 dollars) is added to Jill's row, and the entry with index value `1` (10 dollars) is added to Kumar's row. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Forcing row-by-row alignment\n", "\n", "Thankfully, it's not too hard to avoid index alignment. When you reset an index, by default your data gets a new index where each row's index value is it's row number. To see this in action, let's start our exercise over by creating our original data structures again: " ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "attendees = pd.DataFrame(\n", " {\n", " \"names\": [\"Jill\", \"Kumar\", \"Zaira\"],\n", " \"prizes\": [0, 0, 0],\n", " \"arrival_order\": [2, 1, 3],\n", " }\n", ")\n", "arrival_prizes = pd.Series([20, 10, 0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's sort `attendees` by `arrival_order` just like last time: " ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namesprizesarrival_order
1Kumar01
0Jill02
2Zaira03
\n", "
" ], "text/plain": [ " names prizes arrival_order\n", "1 Kumar 0 1\n", "0 Jill 0 2\n", "2 Zaira 0 3" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "attendees = attendees.sort_values(\"arrival_order\")\n", "attendees" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But now *before* we add `arrival_prizes` to `attendees`, let's reset the index of `arrival_prizes`:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
indexnamesprizesarrival_order
01Kumar01
10Jill02
22Zaira03
\n", "
" ], "text/plain": [ " index names prizes arrival_order\n", "0 1 Kumar 0 1\n", "1 0 Jill 0 2\n", "2 2 Zaira 0 3" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "attendees = attendees.reset_index()\n", "attendees" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see, the new index (numbers in bold on left side) is now just the row numbers. \n", "\n", "However, as you'll see the old index has now been moved over to create a new column. Confusingly, `pandas` likes to call that new column... `index`. Yeah, I know. It's not *the* index, it's just a column *named* index. 😫.\n", "\n", "(To avoid this problem, you can use the `drop=True` option (`reset_index(drop=True)`). But I wanted to show you the behavior if you don't specify that so you aren't confused when you see this for the first time. )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "OK, so now that the index for `attendees` is just row numbers, and that's also the organization we have for `arrival_prizes`: " ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 20\n", "1 10\n", "2 0\n", "dtype: int64" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "arrival_prizes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "NOW we can add them together and they will add up row-by-row: " ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
indexnamesprizesarrival_order
01Kumar201
10Jill102
22Zaira03
\n", "
" ], "text/plain": [ " index names prizes arrival_order\n", "0 1 Kumar 20 1\n", "1 0 Jill 10 2\n", "2 2 Zaira 0 3" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "attendees[\"prizes\"] = attendees[\"prizes\"] + arrival_prizes\n", "attendees" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And we get the result we expected!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## When Index Alignment Comes Up\n", "\n", "The other thing about index alignment is that it thankfully doesn't come up all that often. Indeed, that's why it's often not emphasized in intro exercises. **That's because different columns in the same DataFrame always share the same index, so when you execute operations using columns from the same DataFrame, index alignment looks like order alignment.** This issue only comes up with you are doing an operation on `Series` that are *not* from the same `DataFrame`. " ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.4" } }, "nbformat": 4, "nbformat_minor": 4 }