{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Identifying Data Problems" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In a perfect world, our datasets would all arrive containing only accurate data formatted in exactly the way we need. Sadly, the world we live in is *not* perfect. Either because of clumsy data entry, bad translation between data formats, survey participants trying to be jerks, or the coding errors of other data scientists, messy data is a fact of life. As a result, a key part of being a data scientist is to learn to (a) seek out and identify data cleanliness problems, and (b) learn to correct them. \n", "\n", "These parts of data science are not nearly as cool or widely discussed as new machine learning packages or the latest learning algorithm, but as anyone who has done data science in the real world will tell you, data cleaning and wrangling will take up a very large portion of your working life. Indeed, it is often said that \"data scientists spend 80% of their time cleaning data and only 20% doing analyses\" (or, in the version I prefer, \"data scientists spend 80% of their time cleaning data and the other 20% of their time complaining about cleaning data.\") That's probably a little high on average—a [recent survey by Anaconda](https://www.anaconda.com/state-of-data-science-2020) found on average data scientists reported data wrangling took up about 45% of their time—but what is undisputed is its importance to being a successful data scientist.\n", "\n", "So in this reading we will discuss tools in pandas for identifying problems before we turn to tools for correcting problems in our next reading!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Identifying Data Problems in Pandas \n", "\n", "Everything I said above may sound... well, awful, but here's the good news: once you get into it, data cleaning begins to feel less like grunt work and more like being a detective, and it can actually be a lot of fun.\n", "\n", "\n", "### Your Challenge\n", "\n", "Suppose you have been asked to calculate the overall US employment rate and average incomes for men and women, as well as the employment rate and average income of people in their 30s and 40s. To accomplish this task, you have been handed a 1% sample of the US American Community Survey—a representative survey of residents of the United States published by the US Census Bureau. This is real data, and so like all real data comes with lots of quirk and oddities we'll need to navigate.\n", "\n", "Moreover, you've even been told that the variable for gender in the dataset is `sex`, the variable for age is `age`, the variable for income is `inctot`, and the variable for whether someone is employed is `empstat`. Should be pretty easy, right? Let's do it!\n", "\n", "So the first thing, of course, is we'll load the data and subset for the variables we want:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "# Here we load the American Community Survey data.\n", "# Note that here we're using a data loading trick\n", "# we didn't discuss in detail in our previous readings:\n", "# pandas will read data directly from a URL!\n", "\n", "acs = pd.read_stata(\n", " \"https://github.com/nickeubank/MIDS_Data/blob/master\"\n", " \"/US_AmericanCommunitySurvey/US_ACS_2017_10pct_sample.dta?raw=true\"\n", ")\n", "\n", "# For this exercise, we'll focus on just these five variables,\n", "# which from the official documentation provided by the US census \n", "# bureau we know consist of the year the survey was conducted,\n", "# the gender of the respondent, their age, their total income, \n", "# and their employment status.\n", "#\n", "# You can find the documenation for this data here: \n", "\n", "acs = acs[[\"year\", \"sex\", \"age\", \"inctot\", \"empstat\"]]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's start to analyze our data. For starters, let's get warmed up by just getting the average income in our data:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1723646.2703978634" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "acs[\"inctot\"].mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Tada! Well that was easy. Except... that result says the average income in the US is... 1,723,646 dollars. And that's *clearly* not right—the United States is a very economically fortunate country, but the average American is definitely not a millionaire. \n", "\n", "OK, let's try something different—how about the overall employment rate? The person who gave you the data told you that people who are employed have `empstat` values of `\"employed\"`, so this should be easy too!" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.46632017153389926" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(acs[\"empstat\"] == \"employed\").mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Uh.... ok well the US employment rate is definitely not 47%. That'd imply an unemployment rate of 53%!\n", "\n", "So what's going wrong? At this point, there's no way to know, because we jumped to analyzing our data without ever getting to know it. So let's stop, take a step back, and get to know our ACS data before we start trying to analyze it. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Getting to Know Your Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Whenever you get a new dataset, the first thing you probably want to do is just look at a few rows of the data! Seriously. Just look at it—it's amazing how often people skip this step. \n", "\n", "While there are many ways to do this, I personally recommend using the `.sample()` method, which will provide a random sample of rows from your data. By default, it only returns 1 row, but you can pass in the number of rows you want to get more (e.g., `.sample(5)` for five rows).\n", "\n", "Note that many people also like to use `.head()` (to get the first 5 rows of the data) or `.tail()` (to get the last 5). These are also fine tools to use, but be aware that because most data has been sorted in some way, the first or last 5 rows are unlikely to be *representative* of the data as a whole. Because `.sample()` gives you a random sample of rows, by contrast, it will give you a better sense of what the average row in your dataset looks like. " ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | year | \n", "sex | \n", "age | \n", "inctot | \n", "empstat | \n", "
---|---|---|---|---|---|
166590 | \n", "2017 | \n", "male | \n", "62 | \n", "170000 | \n", "employed | \n", "
207895 | \n", "2017 | \n", "female | \n", "6 | \n", "9999999 | \n", "n/a | \n", "
214500 | \n", "2017 | \n", "male | \n", "18 | \n", "0 | \n", "not in labor force | \n", "
28863 | \n", "2017 | \n", "female | \n", "less than 1 year old | \n", "9999999 | \n", "n/a | \n", "
18280 | \n", "2017 | \n", "female | \n", "11 | \n", "9999999 | \n", "n/a | \n", "