Grouping in multiple dimensions with pivot_tables

Grouping in multiple dimensions with pivot_tables#

Using the groupby method, we were able to group data in the above examples according to year, and calculate some aggregate measure of the examples in each group. In our example, we calculated the total sales in each year. But what if we wanted to break it down further? Let’s consider if our sales data had an extra column, called ‘expertise’ as shown below. The new column represents the level of expertise of a sales representative as either “high” or “low” according to the judgement of their colleagues to see if sales were random or if their skill was correlated with their sales.

import pandas as pd

pd.set_option("mode.copy_on_write", True)

sales = pd.DataFrame(
    data={
        "employee": [
            "Katrina",
            "Guanyu",
            "Jan",
            "Roman",
            "Jacqueline",
            "Paola",
            "Esperanza",
            "Alaina",
            "Egweyn",
        ],
        "sales": [14, 17, 6, 12, 8, 3, 7, 15, 5],
        "expertise": ["high", "high", "low", "low", "low", "low", "low", "high", "low"],
        "year": [2018, 2019, 2020, 2018, 2020, 2019, 2019, 2020, 2020],
    }
)
sales
employee sales expertise year
0 Katrina 14 high 2018
1 Guanyu 17 high 2019
2 Jan 6 low 2020
3 Roman 12 low 2018
4 Jacqueline 8 low 2020
5 Paola 3 low 2019
6 Esperanza 7 low 2019
7 Alaina 15 high 2020
8 Egweyn 5 low 2020

This presents a new question: what were average sales per employee by year AND by the level of expertise. In this case, we don’t just want to know what the average sales were in 2018 across the employees, we want to know what the average sales in 2018 were for the “high” expertise employees and the “low” expertise employees, respectively. To accomplish this, we can use pivot_tables, as shown in the figure below:

Pivot Table

Now lets try that for ourselves:

avg_sales = sales.pivot_table(
    values="sales", index="year", columns="expertise", aggfunc="mean"
)
avg_sales
expertise high low
year
2018 14.0 12.000000
2019 17.0 5.000000
2020 15.0 6.333333

Summary#

Grouping data along either one characteristic (groupby) or two (or more) characteristics (pivot_table) can allow us to quickly aggregate our data into the format we need to be able to answer key data science questions. Understanding how to use these tools programmatically is a significant advantage for data preparation and analysis. We’ve talked quite a bit about preparing data up to this point. In the next lesson, you’ll explore how to query your data quickly and efficiently.