Index Alignment Exercises, Discussion

If you didn’t see the problem coming and change your code, you probably did the following:

[1]:
import pandas as pd
attendees = pd.DataFrame({'names': ["Jill", "Kumar", "Zaira"],
                          'prizes': [0, 0, 0],
                          'arrival_order': [2, 1, 3]})
arrival_prizes = pd.Series([20, 10, 0])
arrival_prizes
[1]:
0    20
1    10
2     0
dtype: int64
[2]:
attendees = attendees.sort_values('arrival_order')
attendees
[2]:
names prizes arrival_order
1 Kumar 0 1
0 Jill 0 2
2 Zaira 0 3
[3]:
attendees['prizes'] = attendees['prizes'] + arrival_prizes
attendees
[3]:
names prizes arrival_order
1 Kumar 10 1
0 Jill 20 2
2 Zaira 0 3

The Problem

Uh oh… as you can see, 20 dollars went to the person who arrived second, and 10 dollars go to the person who arrived first… Why did that happen?

The answer is index alignment.

In numpy or R, when you try and add two arrays of the same length, the first entry of the first array is added to the first entry of the second array to create the first entry of the result; the second entry is added to the second entry, etc. For example:

[4]:
import numpy as np
np.array([1, 2, 3]) + np.array([1, 2, 3])
[4]:
array([2, 4, 6])

But that is NOT how pandas operates. Instead, pandas will always align data based on index values. And when you sort data, the index value associated with each row doesn’t change. Take a look at attendees: when we sorted the data, the data was re-ordered, but so were the index values: Jill is now the second row, but her index value is still 0, Kumar is now the first row, but his index value is still 1.

[5]:
attendees
[5]:
names prizes arrival_order
1 Kumar 10 1
0 Jill 20 2
2 Zaira 0 3

The result is that when you combine attendees['arrival_order'] with prizes, the entry of prizes with index value 0 (20 dollars) is added to Jill’s row, and the entry with index value 1 (10 dollars) is added to Kumar’s row.

Forcing row-by-row alignment

Thankfully, it’s not too hard to avoid index alignment. When you reset an index, by default your data gets a new index where each row’s index value is it’s row number. To see this in action, let’s start our exercise over by creating our original data structures again:

[6]:
import pandas as pd
attendees = pd.DataFrame({'names': ["Jill", "Kumar", "Zaira"],
                          'prizes': [0, 0, 0],
                          'arrival_order': [2, 1, 3]})
arrival_prizes = pd.Series([20, 10, 0])

Now let’s sort attendees by arrival_order just like last time:

[7]:
attendees = attendees.sort_values('arrival_order')
attendees
[7]:
names prizes arrival_order
1 Kumar 0 1
0 Jill 0 2
2 Zaira 0 3

But now before we add arrival_prizes to attendees, let’s reset the index of arrival_prizes:

[8]:
attendees = attendees.reset_index()
attendees
[8]:
index names prizes arrival_order
0 1 Kumar 0 1
1 0 Jill 0 2
2 2 Zaira 0 3

As you can see, the new index (numbers in bold on left side) is now just the row numbers.

However, as you’ll see the old index has now been moved over to create a new column. Confusingly, pandas likes to call that new column… index. Yeah, I know. It’s not the index, it’s just a column named index. 😫.

(To avoid this problem, you can use the drop=True option (reset_index(drop=True)). But I wanted to show you the behavior if you don’t specify that so you aren’t confused when you see this for the first time. )

OK, so now that the index for attendees is just row numbers, and that’s also the organization we have for arrival_prizes:

[9]:
arrival_prizes
[9]:
0    20
1    10
2     0
dtype: int64

NOW we can add them together and they will add up row-by-row:

[10]:
attendees['prizes'] = attendees['prizes'] + arrival_prizes
attendees
[10]:
index names prizes arrival_order
0 1 Kumar 20 1
1 0 Jill 10 2
2 2 Zaira 0 3

And we get the result we expected!

When Index Alignment Comes Up

The other thing about index alignment is that it thankfully doesn’t come up all that often. Indeed, that’s why it’s often not emphasized in intro exercises. That’s because different columns in the same DataFrame always share the same index, so when you execute operations using columns from the same DataFrame, index alignment looks like order alignment. This issue only comes up with you are doing an operation on Series that are not from the same DataFrame.