{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Missing Data\n",
    "\n",
    "In our previous reading, we encountered a situation in which observations of total income (`inctot`) in our US American Community Survey data were set to `9999999` for children as a way of indicating that there was no meaningful data available for that group. And while using special numbers—Sentinel Values—to denote a lack of data / missing data is increasingly uncommon, missing data itself is not. There are endless situations in which we want to deliberately communicate that data is unavailable for specific observations. For example, someone taking a survey may have refused to answer certain questions, or a sensor on an industrial machine may have failed to record data for a specific period due to a wiring issue. \n",
    "\n",
    "In those situations, we have two main options for indicating an observation is missing in Python: \n",
    "\n",
    "- `np.nan`: The most common is the \"Not a Number\" object in the numpy library: `np.nan`. `np.nan` is actually a specific value that can be taking by floating point numbers, so a `np.nan` value can appear in a numeric Series without causing problems. \n",
    "- `None`: The other way for denoting a missing value is the Python `None` object. As `None` is a Python object, it can only be used in a Series with an `object` datatype. \n",
    "\n",
    "While `np.nan` and `None` look like normal entries in a pandas Series or DataFrame, they do exhibit some odd behavior to be aware of. \n",
    "\n",
    "First, as happens with missing values in many languages, `np.nan == np.nan` will always return false:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "False"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import numpy as np\n",
    "\n",
    "np.nan == np.nan"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This isn't true for `None`, as `None == None` does return `True`, but as a Data Scientist you'll see `np.nan` more often than `None` since it can represent missing data in a numeric Series."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This can cause problems if, for example, you wanted to get all the rows of a DataFrame with missing values. To illustrate, suppose our small world dataset didn't have data for Mozambique (the value was `np.nan`):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>country</th>\n",
       "      <th>region</th>\n",
       "      <th>gdppcap08</th>\n",
       "      <th>polityIV</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Brazil</td>\n",
       "      <td>S. America</td>\n",
       "      <td>10296</td>\n",
       "      <td>18.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Germany</td>\n",
       "      <td>W. Europe</td>\n",
       "      <td>35613</td>\n",
       "      <td>20.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Mexico</td>\n",
       "      <td>N. America</td>\n",
       "      <td>14495</td>\n",
       "      <td>18.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Mozambique</td>\n",
       "      <td>Africa</td>\n",
       "      <td>855</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Russia</td>\n",
       "      <td>C&amp;E Europe</td>\n",
       "      <td>16139</td>\n",
       "      <td>17.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Ukraine</td>\n",
       "      <td>C&amp;E Europe</td>\n",
       "      <td>7271</td>\n",
       "      <td>16.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      country      region  gdppcap08  polityIV\n",
       "0      Brazil  S. America      10296      18.0\n",
       "1     Germany   W. Europe      35613      20.0\n",
       "2      Mexico  N. America      14495      18.0\n",
       "3  Mozambique      Africa        855       NaN\n",
       "4      Russia  C&E Europe      16139      17.0\n",
       "5     Ukraine  C&E Europe       7271      16.0"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "\n",
    "smallworld = smallworld = pd.read_csv(\n",
    "    \"https://raw.githubusercontent.com/nickeubank/practicaldatascience_book\"\n",
    "    \"/main/notebooks/class_3/week_3/data/world-very-small.csv\"\n",
    ")\n",
    "smallworld.loc[smallworld.country == \"Mozambique\", \"polityIV\"] = np.nan\n",
    "smallworld"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now suppose we wanted to get all the rows with missing polity scores. The natural way to do this would be:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>country</th>\n",
       "      <th>region</th>\n",
       "      <th>gdppcap08</th>\n",
       "      <th>polityIV</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "Empty DataFrame\n",
       "Columns: [country, region, gdppcap08, polityIV]\n",
       "Index: []"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "smallworld.loc[smallworld.polityIV == np.nan]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "But as we can see, this doesn't work! Why? Because even when the actual value in the DataFrame *is* `np.nan`, `np.nan == np.nan` doesn't return `True`, and so we don't get any rows. \n",
    "\n",
    "To deal with this, pandas offers two utility functions: `pd.isnull(np.nan)` and `pd.notnull(np.nan)` that you can use. So for example, to get rows with no polity scores, we'd run:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>country</th>\n",
       "      <th>region</th>\n",
       "      <th>gdppcap08</th>\n",
       "      <th>polityIV</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Mozambique</td>\n",
       "      <td>Africa</td>\n",
       "      <td>855</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      country  region  gdppcap08  polityIV\n",
       "3  Mozambique  Africa        855       NaN"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "smallworld.loc[pd.isnull(smallworld.polityIV)]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Or we can use it as a method too:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>country</th>\n",
       "      <th>region</th>\n",
       "      <th>gdppcap08</th>\n",
       "      <th>polityIV</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Mozambique</td>\n",
       "      <td>Africa</td>\n",
       "      <td>855</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      country  region  gdppcap08  polityIV\n",
       "3  Mozambique  Africa        855       NaN"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "smallworld.loc[smallworld.polityIV.isnull()]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`pd.isnull()` also has the advantage of working the same with both `None` and `np.nan`, as:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.isnull(None)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So if, for example, our data didn't have a polity score for Mozambique but we just learned what it should be, we could edit that specific value with:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>country</th>\n",
       "      <th>region</th>\n",
       "      <th>gdppcap08</th>\n",
       "      <th>polityIV</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Brazil</td>\n",
       "      <td>S. America</td>\n",
       "      <td>10296</td>\n",
       "      <td>18.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Germany</td>\n",
       "      <td>W. Europe</td>\n",
       "      <td>35613</td>\n",
       "      <td>20.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Mexico</td>\n",
       "      <td>N. America</td>\n",
       "      <td>14495</td>\n",
       "      <td>18.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Mozambique</td>\n",
       "      <td>Africa</td>\n",
       "      <td>855</td>\n",
       "      <td>5.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Russia</td>\n",
       "      <td>C&amp;E Europe</td>\n",
       "      <td>16139</td>\n",
       "      <td>17.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Ukraine</td>\n",
       "      <td>C&amp;E Europe</td>\n",
       "      <td>7271</td>\n",
       "      <td>16.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      country      region  gdppcap08  polityIV\n",
       "0      Brazil  S. America      10296      18.0\n",
       "1     Germany   W. Europe      35613      20.0\n",
       "2      Mexico  N. America      14495      18.0\n",
       "3  Mozambique      Africa        855       5.0\n",
       "4      Russia  C&E Europe      16139      17.0\n",
       "5     Ukraine  C&E Europe       7271      16.0"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "smallworld.loc[smallworld.polityIV.isnull(), \"polityIV\"] = 5\n",
    "smallworld"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Missing Data and .value_counts()\n",
    "\n",
    "One important thing to flag is that the `.value_counts()` tool we introduced in our data cleaning exercise will, by default, *ignore* missing values when it provides a list of all unique values and their frequencies. So if, for example, our American Community Survey data used `np.nan` instead of `n/a` for missing values of the employment status, `.value_counts()` would do the following:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "empstat\n",
       "employed              148758\n",
       "not in labor force    104676\n",
       "unemployed              7727\n",
       "n/a                        0\n",
       "Name: count, dtype: int64"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "acs = pd.read_stata(\n",
    "    \"https://github.com/nickeubank/MIDS_Data/blob/master\"\n",
    "    \"/US_AmericanCommunitySurvey/US_ACS_2017_10pct_sample.dta?raw=true\"\n",
    ")\n",
    "\n",
    "acs.loc[acs[\"empstat\"] == \"n/a\", \"empstat\"] = np.nan\n",
    "acs[\"empstat\"].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "While this can be a useful behavior, sometimes we want to know about observations with missing values. In those situations, you have to pass the `dropna=False` keyword argument to `value_counts()`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "empstat\n",
       "employed              148758\n",
       "not in labor force    104676\n",
       "NaN                    57843\n",
       "unemployed              7727\n",
       "n/a                        0\n",
       "Name: count, dtype: int64"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "acs[\"empstat\"].value_counts(dropna=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Other Missing Data Tools"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In addition to `pd.isnull()` and `pd.notnull()`, pandas also has some additional missing value tools that you may find useful:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- `.fillna(x)`: Replace missing values with the value of `x`\n",
    "- `.dropna()`: Drop rows with missing values.\n",
    "- `.count()`: Returns the number of NON-MISSING observations. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And that's how, once we've found problems, we can fix problems in our data!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Review:\n",
    "\n",
    "- Long ago, people represented missing data using \"Sentinel Values\"—numeric values unlikely to occur naturally, like `9999999`.\n",
    "- Today, Python has two representations of missing data: `np.nan`, which is a special value of floating point numbers, and `None`, which is a Python object. \n",
    "- A special characteristic of `np.nan` is that `np.nan == np.nan` will always return `False`, so to subset for missing observations, one must use `pd.isnull()` or `pd.notnull()`, not `== np.nan` or `!= np.nan`. \n",
    "- This is not true of `None`, as `None == None` returns `True`, but as a data scientist you're likely to encounter `np.nan` more than `None`.\n",
    "- Moreover, `pd.isnull()` and `pd.notnull()` treat `np.nan` and `None` identically.\n",
    "- `.value_counts()` will only report the number of observations that are missing if you use `.value_counts(dropna=False)`."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3.9.7 ('base')",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.4"
  },
  "vscode": {
   "interpreter": {
    "hash": "3e0a5228cb9726a24d36227c69ed0d3aac98cecda769d1c9adb080711d57f90d"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}