{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Parquet\n",
    "\n",
    "When it comes to storing tabular data in Python, there are a lot of choices, many of which we've talked about before (HDF5, CSV, dta, etc.). However, the strorage format I think it best today (October 2020) is `parquet`.\n",
    "\n",
    "`parquet` has a number of strengths:\n",
    "\n",
    "- **It preserves type information**: Unlike a CSV, `parquet` files remember what columns are numeric, which are categorical, etc. etc., so when you re-load your data you can be assured it will look the same as it did when you saved it. \n",
    "- **It's fast:** That type information means when loading, Python doesn't have to try and figure out the types of each column.\n",
    "- **It's small:** `parquet` compresses your data automatically (and no, that doesn't slow it down -- it fact it makes it faster. The reason is that getting data from memory is such a comparatively slow operation, it's faster to load compressed data to RAM and then decompress it than to transfer larger uncompressed files). \n",
    "- **It's portable:** `parquet` is not a Python-specific format -- it's an Apache Software Foundation standard.\n",
    "- **It's built for distributed computing:** `parquet` was actually invented to support Hadoop distributed computing. \n",
    "\n",
    "To use it, install `fastparquet` with `conda install -c conda-forge fastparquet`. (Note there's a second engine out there, `pyarrow`, but I've found people have fewer problems with `fastparquet`). "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Using parquet\n",
    "\n",
    "Here's some code examples for saving and loading `parquet` files:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Make a toy dataset\n",
    "\n",
    "import pandas as pd\n",
    "import numpy.random as npr\n",
    "import random\n",
    "import string\n",
    "\n",
    "size = 100000\n",
    "\n",
    "df = pd.DataFrame({'a': npr.randint(0, 10000, size=size), \n",
    "                   'b': [''.join(random.choices(string.ascii_uppercase, k=10))\n",
    "                         for i in range(size)]} )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>a</th>\n",
       "      <th>b</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>6740</td>\n",
       "      <td>EWCSKOLHUV</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>7301</td>\n",
       "      <td>VFGZSHNCKL</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>7981</td>\n",
       "      <td>TAOQCCNKHH</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>739</td>\n",
       "      <td>QWFDEASRUM</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>3250</td>\n",
       "      <td>DQCFYWCFMX</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      a           b\n",
       "0  6740  EWCSKOLHUV\n",
       "1  7301  VFGZSHNCKL\n",
       "2  7981  TAOQCCNKHH\n",
       "3   739  QWFDEASRUM\n",
       "4  3250  DQCFYWCFMX"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "a       int64\n",
       "b    category\n",
       "dtype: object"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# We can store \"b\" as a categorical (analogous to Factors in R). \n",
    "# This is helpful, for reasons we'll discuss on Tuesday,\n",
    "# but categoricals are only preserved by fastparquet. \n",
    "df['b'] = pd.Categorical(df['b'])\n",
    "df.dtypes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Save\n",
    "df.to_parquet('/users/nick/desktop/test.parquet', engine='fastparquet')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>a</th>\n",
       "      <th>b</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>6740</td>\n",
       "      <td>EWCSKOLHUV</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>7301</td>\n",
       "      <td>VFGZSHNCKL</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>7981</td>\n",
       "      <td>TAOQCCNKHH</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>739</td>\n",
       "      <td>QWFDEASRUM</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>3250</td>\n",
       "      <td>DQCFYWCFMX</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      a           b\n",
       "0  6740  EWCSKOLHUV\n",
       "1  7301  VFGZSHNCKL\n",
       "2  7981  TAOQCCNKHH\n",
       "3   739  QWFDEASRUM\n",
       "4  3250  DQCFYWCFMX"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Reload\n",
    "df2 = pd.read_parquet('/users/nick/desktop/test.parquet', engine='fastparquet')\n",
    "df2.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "a       int64\n",
       "b    category\n",
       "dtype: object"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Note that column b is still a categorical. \n",
    "df2.dtypes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can also easily load subsets of columns:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>b</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>EWCSKOLHUV</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>VFGZSHNCKL</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>TAOQCCNKHH</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>QWFDEASRUM</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>DQCFYWCFMX</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            b\n",
       "0  EWCSKOLHUV\n",
       "1  VFGZSHNCKL\n",
       "2  TAOQCCNKHH\n",
       "3  QWFDEASRUM\n",
       "4  DQCFYWCFMX"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df2 = pd.read_parquet('/users/nick/desktop/test.parquet', \n",
    "                      engine='fastparquet',\n",
    "                      columns=['b'])\n",
    "df2.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Chunking with parquet\n",
    "\n",
    "Parquet allows chunking, but not quite as easily as you can chunk a csv. \n",
    "\n",
    "First, you have to save your file into chunks of a size you want. Parquet always saves rows as chunks, but by default each chunk has 50,000,000 rows, which (if you're chunking) may be too many:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Save into 50,000 row chunks, \n",
    "# so we should get file saved into two chunks. \n",
    "\n",
    "df.to_parquet('/users/nick/desktop/test.parquet', \n",
    "              engine='fastparquet', \n",
    "              row_group_offsets=50000)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "          a           b\n",
      "0      6740  EWCSKOLHUV\n",
      "1      7301  VFGZSHNCKL\n",
      "2      7981  TAOQCCNKHH\n",
      "3       739  QWFDEASRUM\n",
      "4      3250  DQCFYWCFMX\n",
      "...     ...         ...\n",
      "49995  9232  FDRSPGYWQK\n",
      "49996  3512  ZTVPEYODWM\n",
      "49997  3604  JNCMXHQTDB\n",
      "49998  3119  LDHOQVNKZP\n",
      "49999  6682  PDQAPAIHLV\n",
      "\n",
      "[50000 rows x 2 columns]\n",
      "          a           b\n",
      "0      1922  ZOGCWFRWHZ\n",
      "1      8675  NRZMUNYKZO\n",
      "2      4979  MCWILQYNYX\n",
      "3      2481  FPEEHRFGKA\n",
      "4      2829  AQYRGYKOOC\n",
      "...     ...         ...\n",
      "49995  4896  LWCZOQLANI\n",
      "49996  7124  ARRWJCYPCO\n",
      "49997   459  YELIRSTLGV\n",
      "49998  5169  PBRGOPUJXJ\n",
      "49999  8696  GAWEGKTKXI\n",
      "\n",
      "[50000 rows x 2 columns]\n"
     ]
    }
   ],
   "source": [
    "# Then we have to read it in using the `fastparquet` \n",
    "# library itself (there's no way to do this directly from \n",
    "# pandas I'm afraid):\n",
    "\n",
    "from fastparquet import ParquetFile\n",
    "pf = ParquetFile('/users/nick/desktop/test.parquet')\n",
    "\n",
    "# Iterates over row groups\n",
    "for rg in pf.iter_row_groups():\n",
    "    print(rg)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>a</th>\n",
       "      <th>b</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>16717</th>\n",
       "      <td>1704</td>\n",
       "      <td>OBREANGVDA</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10232</th>\n",
       "      <td>3898</td>\n",
       "      <td>FTLOBLTJAA</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5049</th>\n",
       "      <td>6750</td>\n",
       "      <td>SFVMHGQEYN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>24924</th>\n",
       "      <td>6396</td>\n",
       "      <td>JKLIAJOLZZ</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>24333</th>\n",
       "      <td>4622</td>\n",
       "      <td>PEVOFCIOXP</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>45816</th>\n",
       "      <td>2446</td>\n",
       "      <td>MXYZROVFIV</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12347</th>\n",
       "      <td>7040</td>\n",
       "      <td>XVQUNBQDEK</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>42041</th>\n",
       "      <td>6730</td>\n",
       "      <td>HMAOJWGSTH</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13623</th>\n",
       "      <td>8652</td>\n",
       "      <td>USBEMPWYAH</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>44521</th>\n",
       "      <td>250</td>\n",
       "      <td>WTWFZMRKRJ</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          a           b\n",
       "16717  1704  OBREANGVDA\n",
       "10232  3898  FTLOBLTJAA\n",
       "5049   6750  SFVMHGQEYN\n",
       "24924  6396  JKLIAJOLZZ\n",
       "24333  4622  PEVOFCIOXP\n",
       "45816  2446  MXYZROVFIV\n",
       "12347  7040  XVQUNBQDEK\n",
       "42041  6730  HMAOJWGSTH\n",
       "13623  8652  USBEMPWYAH\n",
       "44521   250  WTWFZMRKRJ"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# So you can filter and work with each group if you want:\n",
    "all_rows_w_even_a = []\n",
    "for rg in pf.iter_row_groups():\n",
    "    rg = rg.loc[(rg['a'] % 2) == 0]\n",
    "    all_rows_w_even_a.append(rg)\n",
    "\n",
    "even_rows = pd.concat(all_rows_w_even_a)\n",
    "even_rows.sample(10)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}