Grouping in multiple dimensions with pivot_tables

Grouping in multiple dimensions with `pivot_tables`#

Using the groupby method, we were able to group data in the above examples according to year, and calculate some aggregate measure of the examples in each group. In our example, we calculated the total sales in each year. But what if we wanted to break it down further? Let’s consider if our sales data had an extra column, called ‘expertise’ as shown below. The new column represents the level of expertise of a sales representative as either “high” or “low” according to the judgement of their colleagues to see if sales were random or if their skill was correlated with their sales.

import pandas as pd

pd.set_option("mode.copy_on_write", True)

sales = pd.DataFrame(
    data={
        "employee": [
            "Katrina",
            "Guanyu",
            "Jan",
            "Roman",
            "Jacqueline",
            "Paola",
            "Esperanza",
            "Alaina",
            "Egweyn",
        ],
        "sales": [14, 17, 6, 12, 8, 3, 7, 15, 5],
        "expertise": ["high", "high", "low", "low", "low", "low", "low", "high", "low"],
        "year": [2018, 2019, 2020, 2018, 2020, 2019, 2019, 2020, 2020],
    }
)
sales

	employee	sales	expertise	year
0	Katrina	14	high	2018
1	Guanyu	17	high	2019
2	Jan	6	low	2020
3	Roman	12	low	2018
4	Jacqueline	8	low	2020
5	Paola	3	low	2019
6	Esperanza	7	low	2019
7	Alaina	15	high	2020
8	Egweyn	5	low	2020

This presents a new question: what were average sales per employee by year AND by the level of expertise. In this case, we don’t just want to know what the average sales were in 2018 across the employees, we want to know what the average sales in 2018 were for the “high” expertise employees and the “low” expertise employees, respectively. To accomplish this, we can use pivot_tables, as shown in the figure below:

Pivot Table

Now lets try that for ourselves:

avg_sales = sales.pivot_table(
    values="sales", index="year", columns="expertise", aggfunc="mean"
)
avg_sales

expertise	high	low
year
2018	14.0	12.000000
2019	17.0	5.000000
2020	15.0	6.333333

Summary#

Grouping data along either one characteristic (groupby) or two (or more) characteristics (pivot_table) can allow us to quickly aggregate our data into the format we need to be able to answer key data science questions. Understanding how to use these tools programmatically is a significant advantage for data preparation and analysis. We’ve talked quite a bit about preparing data up to this point. In the next lesson, you’ll explore how to query your data quickly and efficiently.

Grouping in multiple dimensions with pivot_tables

Contents

Grouping in multiple dimensions with pivot_tables#

Summary#

Grouping in multiple dimensions with `pivot_tables`#