{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Parquet\n", "\n", "When it comes to storing tabular data in Python, there are a lot of choices, many of which we've talked about before (HDF5, CSV, dta, etc.). However, the strorage format I think it best today (October 2020) is `parquet`.\n", "\n", "`parquet` has a number of strengths:\n", "\n", "- **It preserves type information**: Unlike a CSV, `parquet` files remember what columns are numeric, which are categorical, etc. etc., so when you re-load your data you can be assured it will look the same as it did when you saved it. \n", "- **It's fast:** That type information means when loading, Python doesn't have to try and figure out the types of each column.\n", "- **It's small:** `parquet` compresses your data automatically (and no, that doesn't slow it down -- it fact it makes it faster. The reason is that getting data from memory is such a comparatively slow operation, it's faster to load compressed data to RAM and then decompress it than to transfer larger uncompressed files). \n", "- **It's portable:** `parquet` is not a Python-specific format -- it's an Apache Software Foundation standard.\n", "- **It's built for distributed computing:** `parquet` was actually invented to support Hadoop distributed computing. \n", "\n", "To use it, install `fastparquet` with `conda install -c conda-forge fastparquet`. (Note there's a second engine out there, `pyarrow`, but I've found people have fewer problems with `fastparquet`). " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Using parquet\n", "\n", "Here's some code examples for saving and loading `parquet` files:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# Make a toy dataset\n", "\n", "import pandas as pd\n", "import numpy.random as npr\n", "import random\n", "import string\n", "\n", "size = 100000\n", "\n", "df = pd.DataFrame({'a': npr.randint(0, 10000, size=size), \n", " 'b': [''.join(random.choices(string.ascii_uppercase, k=10))\n", " for i in range(size)]} )" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ab
06740EWCSKOLHUV
17301VFGZSHNCKL
27981TAOQCCNKHH
3739QWFDEASRUM
43250DQCFYWCFMX
\n", "
" ], "text/plain": [ " a b\n", "0 6740 EWCSKOLHUV\n", "1 7301 VFGZSHNCKL\n", "2 7981 TAOQCCNKHH\n", "3 739 QWFDEASRUM\n", "4 3250 DQCFYWCFMX" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "a int64\n", "b category\n", "dtype: object" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# We can store \"b\" as a categorical (analogous to Factors in R). \n", "# This is helpful, for reasons we'll discuss on Tuesday,\n", "# but categoricals are only preserved by fastparquet. \n", "df['b'] = pd.Categorical(df['b'])\n", "df.dtypes" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "# Save\n", "df.to_parquet('/users/nick/desktop/test.parquet', engine='fastparquet')" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ab
06740EWCSKOLHUV
17301VFGZSHNCKL
27981TAOQCCNKHH
3739QWFDEASRUM
43250DQCFYWCFMX
\n", "
" ], "text/plain": [ " a b\n", "0 6740 EWCSKOLHUV\n", "1 7301 VFGZSHNCKL\n", "2 7981 TAOQCCNKHH\n", "3 739 QWFDEASRUM\n", "4 3250 DQCFYWCFMX" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Reload\n", "df2 = pd.read_parquet('/users/nick/desktop/test.parquet', engine='fastparquet')\n", "df2.head()" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "a int64\n", "b category\n", "dtype: object" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Note that column b is still a categorical. \n", "df2.dtypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can also easily load subsets of columns:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
b
0EWCSKOLHUV
1VFGZSHNCKL
2TAOQCCNKHH
3QWFDEASRUM
4DQCFYWCFMX
\n", "
" ], "text/plain": [ " b\n", "0 EWCSKOLHUV\n", "1 VFGZSHNCKL\n", "2 TAOQCCNKHH\n", "3 QWFDEASRUM\n", "4 DQCFYWCFMX" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2 = pd.read_parquet('/users/nick/desktop/test.parquet', \n", " engine='fastparquet',\n", " columns=['b'])\n", "df2.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Chunking with parquet\n", "\n", "Parquet allows chunking, but not quite as easily as you can chunk a csv. \n", "\n", "First, you have to save your file into chunks of a size you want. Parquet always saves rows as chunks, but by default each chunk has 50,000,000 rows, which (if you're chunking) may be too many:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "# Save into 50,000 row chunks, \n", "# so we should get file saved into two chunks. \n", "\n", "df.to_parquet('/users/nick/desktop/test.parquet', \n", " engine='fastparquet', \n", " row_group_offsets=50000)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " a b\n", "0 6740 EWCSKOLHUV\n", "1 7301 VFGZSHNCKL\n", "2 7981 TAOQCCNKHH\n", "3 739 QWFDEASRUM\n", "4 3250 DQCFYWCFMX\n", "... ... ...\n", "49995 9232 FDRSPGYWQK\n", "49996 3512 ZTVPEYODWM\n", "49997 3604 JNCMXHQTDB\n", "49998 3119 LDHOQVNKZP\n", "49999 6682 PDQAPAIHLV\n", "\n", "[50000 rows x 2 columns]\n", " a b\n", "0 1922 ZOGCWFRWHZ\n", "1 8675 NRZMUNYKZO\n", "2 4979 MCWILQYNYX\n", "3 2481 FPEEHRFGKA\n", "4 2829 AQYRGYKOOC\n", "... ... ...\n", "49995 4896 LWCZOQLANI\n", "49996 7124 ARRWJCYPCO\n", "49997 459 YELIRSTLGV\n", "49998 5169 PBRGOPUJXJ\n", "49999 8696 GAWEGKTKXI\n", "\n", "[50000 rows x 2 columns]\n" ] } ], "source": [ "# Then we have to read it in using the `fastparquet` \n", "# library itself (there's no way to do this directly from \n", "# pandas I'm afraid):\n", "\n", "from fastparquet import ParquetFile\n", "pf = ParquetFile('/users/nick/desktop/test.parquet')\n", "\n", "# Iterates over row groups\n", "for rg in pf.iter_row_groups():\n", " print(rg)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ab
167171704OBREANGVDA
102323898FTLOBLTJAA
50496750SFVMHGQEYN
249246396JKLIAJOLZZ
243334622PEVOFCIOXP
458162446MXYZROVFIV
123477040XVQUNBQDEK
420416730HMAOJWGSTH
136238652USBEMPWYAH
44521250WTWFZMRKRJ
\n", "
" ], "text/plain": [ " a b\n", "16717 1704 OBREANGVDA\n", "10232 3898 FTLOBLTJAA\n", "5049 6750 SFVMHGQEYN\n", "24924 6396 JKLIAJOLZZ\n", "24333 4622 PEVOFCIOXP\n", "45816 2446 MXYZROVFIV\n", "12347 7040 XVQUNBQDEK\n", "42041 6730 HMAOJWGSTH\n", "13623 8652 USBEMPWYAH\n", "44521 250 WTWFZMRKRJ" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# So you can filter and work with each group if you want:\n", "all_rows_w_even_a = []\n", "for rg in pf.iter_row_groups():\n", " rg = rg.loc[(rg['a'] % 2) == 0]\n", " all_rows_w_even_a.append(rg)\n", "\n", "even_rows = pd.concat(all_rows_w_even_a)\n", "even_rows.sample(10)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.8" } }, "nbformat": 4, "nbformat_minor": 4 }