Parquet¶

When it comes to storing tabular data in Python, there are a lot of choices, many of which we’ve talked about before (HDF5, CSV, dta, etc.). However, the strorage format I think it best today (October 2020) is parquet.

parquet has a number of strengths:

It preserves type information: Unlike a CSV, parquet files remember what columns are numeric, which are categorical, etc. etc., so when you re-load your data you can be assured it will look the same as it did when you saved it.
It’s fast: That type information means when loading, Python doesn’t have to try and figure out the types of each column.
It’s small: parquet compresses your data automatically (and no, that doesn’t slow it down – it fact it makes it faster. The reason is that getting data from memory is such a comparatively slow operation, it’s faster to load compressed data to RAM and then decompress it than to transfer larger uncompressed files).
It’s portable: parquet is not a Python-specific format – it’s an Apache Software Foundation standard.
It’s built for distributed computing: parquet was actually invented to support Hadoop distributed computing.

To use it, install fastparquet with conda install -c conda-forge fastparquet. (Note there’s a second engine out there, pyarrow, but I’ve found people have fewer problems with fastparquet).

Using parquet¶

Here’s some code examples for saving and loading parquet files:

[1]:

# Make a toy dataset

import pandas as pd
import numpy.random as npr
import random
import string

size = 100000

df = pd.DataFrame({'a': npr.randint(0, 10000, size=size),
                   'b': [''.join(random.choices(string.ascii_uppercase, k=10))
                         for i in range(size)]} )

[2]:

df.head()

[2]:

	a	b
0	6740	EWCSKOLHUV
1	7301	VFGZSHNCKL
2	7981	TAOQCCNKHH
3	739	QWFDEASRUM
4	3250	DQCFYWCFMX

[3]:

# We can store "b" as a categorical (analogous to Factors in R).
# This is helpful, for reasons we'll discuss on Tuesday,
# but categoricals are only preserved by fastparquet.
df['b'] = pd.Categorical(df['b'])
df.dtypes

[3]:

a       int64
b    category
dtype: object

[4]:

# Save
df.to_parquet('/users/nick/desktop/test.parquet', engine='fastparquet')

[5]:

# Reload
df2 = pd.read_parquet('/users/nick/desktop/test.parquet', engine='fastparquet')
df2.head()

[5]:

	a	b
0	6740	EWCSKOLHUV
1	7301	VFGZSHNCKL
2	7981	TAOQCCNKHH
3	739	QWFDEASRUM
4	3250	DQCFYWCFMX

[6]:

# Note that column b is still a categorical.
df2.dtypes

[6]:

a       int64
b    category
dtype: object

You can also easily load subsets of columns:

[7]:

df2 = pd.read_parquet('/users/nick/desktop/test.parquet',
                      engine='fastparquet',
                      columns=['b'])
df2.head()

[7]:

	b
0	EWCSKOLHUV
1	VFGZSHNCKL
2	TAOQCCNKHH
3	QWFDEASRUM
4	DQCFYWCFMX

Chunking with parquet¶

Parquet allows chunking, but not quite as easily as you can chunk a csv.

First, you have to save your file into chunks of a size you want. Parquet always saves rows as chunks, but by default each chunk has 50,000,000 rows, which (if you’re chunking) may be too many:

[8]:

# Save into 50,000 row chunks,
# so we should get file saved into two chunks.

df.to_parquet('/users/nick/desktop/test.parquet',
              engine='fastparquet',
              row_group_offsets=50000)

[9]:

# Then we have to read it in using the `fastparquet`
# library itself (there's no way to do this directly from
# pandas I'm afraid):

from fastparquet import ParquetFile
pf = ParquetFile('/users/nick/desktop/test.parquet')

# Iterates over row groups
for rg in pf.iter_row_groups():
    print(rg)

          a           b
0      6740  EWCSKOLHUV
1      7301  VFGZSHNCKL
2      7981  TAOQCCNKHH
3       739  QWFDEASRUM
4      3250  DQCFYWCFMX
...     ...         ...
49995  9232  FDRSPGYWQK
49996  3512  ZTVPEYODWM
49997  3604  JNCMXHQTDB
49998  3119  LDHOQVNKZP
49999  6682  PDQAPAIHLV

[50000 rows x 2 columns]
          a           b
0      1922  ZOGCWFRWHZ
1      8675  NRZMUNYKZO
2      4979  MCWILQYNYX
3      2481  FPEEHRFGKA
4      2829  AQYRGYKOOC
...     ...         ...
49995  4896  LWCZOQLANI
49996  7124  ARRWJCYPCO
49997   459  YELIRSTLGV
49998  5169  PBRGOPUJXJ
49999  8696  GAWEGKTKXI

[50000 rows x 2 columns]

[10]:

# So you can filter and work with each group if you want:
all_rows_w_even_a = []
for rg in pf.iter_row_groups():
    rg = rg.loc[(rg['a'] % 2) == 0]
    all_rows_w_even_a.append(rg)

even_rows = pd.concat(all_rows_w_even_a)
even_rows.sample(10)

[10]:

	a	b
16717	1704	OBREANGVDA
10232	3898	FTLOBLTJAA
5049	6750	SFVMHGQEYN
24924	6396	JKLIAJOLZZ
24333	4622	PEVOFCIOXP
45816	2446	MXYZROVFIV
12347	7040	XVQUNBQDEK
42041	6730	HMAOJWGSTH
13623	8652	USBEMPWYAH
44521	250	WTWFZMRKRJ