Reshaping Data#

One theme that we hope is emerging is that your data do not always appear in the form you wish them to be in for analysis. We’ve discussed concatenating, merging, grouping, and pivot_tables. Often we want to modify how data are presented and pandas offers a series of tools for doing just that. In this section, we’ll discuss a few key tools and

Tidy data#

Tidy data is a way of creating an organized, consistent structure for data. A dataset is tidy (and therefore not messy) if it conforms to three principles (from Hadley Wickham):

  1. Each variable forms a column.

  2. Each observation forms a row.

  3. Each type of observational unit forms a table.

The first two of these are the easier ones to understand. An observation could represent a patient and the variables could be that patient’s vital sign readings. An observation could be a product and the variables are the characteristics or measurements of that product including price, weight, or type, for example.

The third component of a tidy dataset, observational units, may take a moment’s discussion. Consider the example of a table of songs. Each song (and observation) may have multiple characteristics (variables) such as track length, title, music genre, artist, etc. What if we also had data about one particular song’s rank on the Billboard Top 100 list, every week for a year? It likely would not make sense to put these in the same table since it would lead to a lot of redundancy since we would need to list the song multiple times for each date to note its ranking.

There are certainly reasons for which Tidy data may not fit your needs perfectly, including if you need data in an alternative format for processing the data using tools that require alternative formats. However, when in doubt, starting with a Tidy data format is a good place to begin since it allows you to quickly launch your data analysis.

Tools for reshaping: melt, pivot, stack, and unstack#

But how do we change the shape of data in pandas tables? We have four main functions: pivot, melt (which is the opposite of pivot), stack and unstack (which is the opposite of stack). We’ll discuss each of these functions below.

Let’s start with a dataset that is not Tidy since there are variables in the columns (temp, heart_rate, and systolic_bp):

import pandas as pd

pd.set_option("mode.copy_on_write", True)

df = pd.DataFrame(
    {
        "patient_name": ["Lekha", "Lekha", "Zhenzhen", "Zhenzhen", "Zhenzhen"],
        "reading": ["temp", "heart_rate", "temp", "heart_rate", "systolic_bp"],
        "reading_values": [101.5, 75, 98.3, 54, 115],
    }
)
df
patient_name reading reading_values
0 Lekha temp 101.5
1 Lekha heart_rate 75.0
2 Zhenzhen temp 98.3
3 Zhenzhen heart_rate 54.0
4 Zhenzhen systolic_bp 115.0

Let’s start with the pivot method. First off, this is NOT the same as pivot_table and has distinct uses which we previously discussed. The pivot method, on the other hand, allows us to reshape our data by making each variable into an individual column:

df_pivoted = df.pivot(index="patient_name", columns="reading", values="reading_values")
df_pivoted
reading heart_rate systolic_bp temp
patient_name
Lekha 75.0 NaN 101.5
Zhenzhen 54.0 115.0 98.3

Before “undoing” this application of pivot to demonstrate melt, we need to make sure the index is one of the columns, so we can use the reset_index() method to accomplish this:

df_pivoted = df_pivoted.reset_index()
df_pivoted
reading patient_name heart_rate systolic_bp temp
0 Lekha 75.0 NaN 101.5
1 Zhenzhen 54.0 115.0 98.3

Now, we can undo this process with the melt method as follows:

df_melted = pd.melt(
    df_pivoted,
    id_vars=["patient_name"],
    value_vars=["heart_rate", "temp", "systolic_bp"],
)
df_melted
patient_name reading value
0 Lekha heart_rate 75.0
1 Zhenzhen heart_rate 54.0
2 Lekha temp 101.5
3 Zhenzhen temp 98.3
4 Lekha systolic_bp NaN
5 Zhenzhen systolic_bp 115.0

This effectively reproduces the original dataset. You’ll notice the addition of a row with a NaN value. This is because the original data didn’t contain an entry for systolic_bp for Lekha. Because of the shape of the data, pivot is often described as going from long data to wide data and melt from wide data to long. Long here refers to the number of rows and wide to the number of columns.

Stack is another way of moving data from a wide to a long format, but by moving data from columns into an index as a multi-index. Let’s take a look at a related example to see what we mean:

df
patient_name reading reading_values
0 Lekha temp 101.5
1 Lekha heart_rate 75.0
2 Zhenzhen temp 98.3
3 Zhenzhen heart_rate 54.0
4 Zhenzhen systolic_bp 115.0
df_stacked = df.stack()
df_stacked
0  patient_name            Lekha
   reading                  temp
   reading_values          101.5
1  patient_name            Lekha
   reading            heart_rate
   reading_values           75.0
2  patient_name         Zhenzhen
   reading                  temp
   reading_values           98.3
3  patient_name         Zhenzhen
   reading            heart_rate
   reading_values           54.0
4  patient_name         Zhenzhen
   reading           systolic_bp
   reading_values          115.0
dtype: object

We can verify that the DataFrame has been transformed into a Series and that series as a multi-index:

type(df_stacked)
pandas.core.series.Series
df_stacked.index
MultiIndex([(0,   'patient_name'),
            (0,        'reading'),
            (0, 'reading_values'),
            (1,   'patient_name'),
            (1,        'reading'),
            (1, 'reading_values'),
            (2,   'patient_name'),
            (2,        'reading'),
            (2, 'reading_values'),
            (3,   'patient_name'),
            (3,        'reading'),
            (3, 'reading_values'),
            (4,   'patient_name'),
            (4,        'reading'),
            (4, 'reading_values')],
           )

We can then recover our original DataFrame with the unstack method:

df_unstacked = df_stacked.unstack()
df_unstacked
patient_name reading reading_values
0 Lekha temp 101.5
1 Lekha heart_rate 75.0
2 Zhenzhen temp 98.3
3 Zhenzhen heart_rate 54.0
4 Zhenzhen systolic_bp 115.0

It’s important to be aware that these reshaping operations exist because this is essentially what’s going on under the hood when more complex operations like pivot_table are applied. In fact, the operations of pivot_table are more or less a combination of groupby and unstack.

Producing Tidy data may require some effort, but the benefits will help you to stardardize your default data format allowing you to spend more time on data analysis and less time fussing with the data to prepare it for analysis.