PyArrow: An Alternative to Numpy as Pandas Backend

PyArrow: An Alternative to Numpy as Pandas Backend#

As discussed in our previous readings, by default pandas data structures are basically numpy arrays wrapped up with some additional functionality. Indeed, we can see this by calling the .values property on our pandas data structures:

import pandas as pd

df = pd.DataFrame({"a": [1, 2, 3], "b": [3.14, 2.718, 1.41421]})

Starting in pandas 2.0, however, it is possible to change how pandas data is stored in the background — instead of storing data in numpy arrays, pandas can now also store data in Arrow arrays using the pyarrow library.

Why A New Backend?#

Arrow arrays are functionally very similar to numpy arrays, but with a few differences behind the scenes.

The first — and as I understand it, the motivation for the creation of Arrow — is that it’s designed to be more interoperable with other programming languages. Basically, the Arrow array format was invented because different programming languages organize the 1s and 0s underlying arrays differently in computer memory. As a result, moving arrays between programming languages—like Python, R, or Julia—requires reformatting data. With Arrow arrays, by contrast, one can move data between programming languages without computational overhead. In other words, it’s a way of storing arrays in memory that is programming language agnostic.

And the dream is that by reducing the overhead to moving between programming languages, fights over which language is “best” would come to an end, and people would move between R, Python, Julia, and who knows what else effortlessly (I’ll admit I’m deeply dubious that will ever happen, but one can dream, right?!).

Arrow was created by the Apache Software Foundation (it’s sometimes called Apache Arrow), and is the in-memory analogue of their parquet file format for storing arrays to disk, something we’ll talk about in a future reading.

String Performance#

But honestly, that’s not the main reason that pandas users are fond of Arrow — the main reason is that Arrow arrays have a string datatype that is far faster and more memory efficient than storing strings in object dtypes. In other words, the Arrow backend has become another hack, of sorts, that sacrifices some of the flexibility of object dtypes for better performance (like the category dtype) in a specific setting — here, data that contains strings.

Indeed, this performance difference is so great that starting in pandas 3.0, all strings will be silently stored as PyArrow strings by default (though with all the syntax you’ve come to know and love from numpy).

Using PyArrow#

The uses of PyArrow for pandas are evolving quickly, so if you’re really interested you may simply wish to go read the pandas docs on the topic.

But to give you a quick sense of things, here’s an example of how you can convert a numpy-backed pandas DataFrame into an Arrow-backed DataFrame:

df = pd.DataFrame({"a": [1, 2, 3], "b": [3.14, 2.718, 1.41421]})
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   a       3 non-null      int64  
 1   b       3 non-null      float64
dtypes: float64(1), int64(1)
memory usage: 180.0 bytes
a      int64
b    float64
dtype: object
df_w_pyarrow_backend = df.convert_dtypes(dtype_backend="pyarrow")
a b
0 1 3.14000
1 2 2.71800
2 3 1.41421
a     int64[pyarrow]
b    double[pyarrow]
dtype: object

And if you want to have pandas read data directly into Arrow arrays when using a pd.read_... function, simply add the keyword argument dtype_backend="pyarrow" (e.g., like df = pd.read_csv("path_to_file.csv", dtype_backend="pyarrow")).

Is This Something To Worry About Now?#

Meh, probably not.

The one situation where I might use pyarrow now is if I were working with LOTS of text data. In that case, there are substantial performance returns to using pyarrow.

But I probably wouldn’t convert all my data into Arrow arrays. As noted above, starting in pandas 3.0, string data will automagically use Arrow string dtypes in the background in a way that doesn’t change how anything works from the perspective of the user. But while this doesn’t become the default behavior till pandas 3.0, it can be enabled early starting with pandas 2.1 using the option:

pd.options.future.infer_string = True
df = pd.DataFrame(
    {"a": [1, 2, 3], "b": [3.14, 2.718, 1.41421], "c": ["hello", "world", "hamburger"]}
a                    int64
b                  float64
c    string[pyarrow_numpy]
dtype: object

But we did want to make sure to tell you about this so you won’t be surprised if you read about it on the internet!