Parquet¶
When it comes to storing tabular data in Python, there are a lot of choices, many of which we’ve talked about before (HDF5, CSV, dta, etc.). However, the strorage format I think it best today (October 2020) is parquet
.
parquet
has a number of strengths:
It preserves type information: Unlike a CSV,
parquet
files remember what columns are numeric, which are categorical, etc. etc., so when you re-load your data you can be assured it will look the same as it did when you saved it.It’s fast: That type information means when loading, Python doesn’t have to try and figure out the types of each column.
It’s small:
parquet
compresses your data automatically (and no, that doesn’t slow it down – it fact it makes it faster. The reason is that getting data from memory is such a comparatively slow operation, it’s faster to load compressed data to RAM and then decompress it than to transfer larger uncompressed files).It’s portable:
parquet
is not a Python-specific format – it’s an Apache Software Foundation standard.It’s built for distributed computing:
parquet
was actually invented to support Hadoop distributed computing.
To use it, install fastparquet
with conda install -c conda-forge fastparquet
. (Note there’s a second engine out there, pyarrow
, but I’ve found people have fewer problems with fastparquet
).
Using parquet¶
Here’s some code examples for saving and loading parquet
files:
[1]:
# Make a toy dataset
import pandas as pd
import numpy.random as npr
import random
import string
size = 100000
df = pd.DataFrame({'a': npr.randint(0, 10000, size=size),
'b': [''.join(random.choices(string.ascii_uppercase, k=10))
for i in range(size)]} )
[2]:
df.head()
[2]:
a | b | |
---|---|---|
0 | 6740 | EWCSKOLHUV |
1 | 7301 | VFGZSHNCKL |
2 | 7981 | TAOQCCNKHH |
3 | 739 | QWFDEASRUM |
4 | 3250 | DQCFYWCFMX |
[3]:
# We can store "b" as a categorical (analogous to Factors in R).
# This is helpful, for reasons we'll discuss on Tuesday,
# but categoricals are only preserved by fastparquet.
df['b'] = pd.Categorical(df['b'])
df.dtypes
[3]:
a int64
b category
dtype: object
[4]:
# Save
df.to_parquet('/users/nick/desktop/test.parquet', engine='fastparquet')
[5]:
# Reload
df2 = pd.read_parquet('/users/nick/desktop/test.parquet', engine='fastparquet')
df2.head()
[5]:
a | b | |
---|---|---|
0 | 6740 | EWCSKOLHUV |
1 | 7301 | VFGZSHNCKL |
2 | 7981 | TAOQCCNKHH |
3 | 739 | QWFDEASRUM |
4 | 3250 | DQCFYWCFMX |
[6]:
# Note that column b is still a categorical.
df2.dtypes
[6]:
a int64
b category
dtype: object
You can also easily load subsets of columns:
[7]:
df2 = pd.read_parquet('/users/nick/desktop/test.parquet',
engine='fastparquet',
columns=['b'])
df2.head()
[7]:
b | |
---|---|
0 | EWCSKOLHUV |
1 | VFGZSHNCKL |
2 | TAOQCCNKHH |
3 | QWFDEASRUM |
4 | DQCFYWCFMX |
Chunking with parquet¶
Parquet allows chunking, but not quite as easily as you can chunk a csv.
First, you have to save your file into chunks of a size you want. Parquet always saves rows as chunks, but by default each chunk has 50,000,000 rows, which (if you’re chunking) may be too many:
[8]:
# Save into 50,000 row chunks,
# so we should get file saved into two chunks.
df.to_parquet('/users/nick/desktop/test.parquet',
engine='fastparquet',
row_group_offsets=50000)
[9]:
# Then we have to read it in using the `fastparquet`
# library itself (there's no way to do this directly from
# pandas I'm afraid):
from fastparquet import ParquetFile
pf = ParquetFile('/users/nick/desktop/test.parquet')
# Iterates over row groups
for rg in pf.iter_row_groups():
print(rg)
a b
0 6740 EWCSKOLHUV
1 7301 VFGZSHNCKL
2 7981 TAOQCCNKHH
3 739 QWFDEASRUM
4 3250 DQCFYWCFMX
... ... ...
49995 9232 FDRSPGYWQK
49996 3512 ZTVPEYODWM
49997 3604 JNCMXHQTDB
49998 3119 LDHOQVNKZP
49999 6682 PDQAPAIHLV
[50000 rows x 2 columns]
a b
0 1922 ZOGCWFRWHZ
1 8675 NRZMUNYKZO
2 4979 MCWILQYNYX
3 2481 FPEEHRFGKA
4 2829 AQYRGYKOOC
... ... ...
49995 4896 LWCZOQLANI
49996 7124 ARRWJCYPCO
49997 459 YELIRSTLGV
49998 5169 PBRGOPUJXJ
49999 8696 GAWEGKTKXI
[50000 rows x 2 columns]
[10]:
# So you can filter and work with each group if you want:
all_rows_w_even_a = []
for rg in pf.iter_row_groups():
rg = rg.loc[(rg['a'] % 2) == 0]
all_rows_w_even_a.append(rg)
even_rows = pd.concat(all_rows_w_even_a)
even_rows.sample(10)
[10]:
a | b | |
---|---|---|
16717 | 1704 | OBREANGVDA |
10232 | 3898 | FTLOBLTJAA |
5049 | 6750 | SFVMHGQEYN |
24924 | 6396 | JKLIAJOLZZ |
24333 | 4622 | PEVOFCIOXP |
45816 | 2446 | MXYZROVFIV |
12347 | 7040 | XVQUNBQDEK |
42041 | 6730 | HMAOJWGSTH |
13623 | 8652 | USBEMPWYAH |
44521 | 250 | WTWFZMRKRJ |