{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Groupby and Arrest Data\n", "\n", "In our merging exercises, we examined the relationship between county-level violent arrest totals and county-level drug arrest totals. In those exercises, you were given a dataset that provided you with county-level arrest totals. But that's not actually how the data is provided by the state of California. This week we will work with the *raw* California arrest data, which is not organized by county or even county-year. \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Gradescope Autograding\n", "\n", "Please follow [all standard guidance](https://www.practicaldatascience.org/html/autograder_guidelines.html) for submitting this assignment to the Gradescope autograder, including storing your solutions in a dictionary called `results` and ensuring your notebook runs from the start to completion without any errors.\n", "\n", "For this assignment, please name your file `exercise_groupby.ipynb` before uploading.\n", "\n", "You can check that you have answers for all questions in your `results` dictionary with this code:\n", "\n", "```python\n", "assert set(results.keys()) == {\n", " \"ex4_num_rows\",\n", " \"ex5_collapsed_vars\",\n", " \"ex7_alameda_1980_share_violent_arrestees_black\",\n", " \"ex11_white_drug_share\",\n", " \"ex11_black_drug_share\",\n", " \"ex12_proportionate\",\n", "}\n", "```\n", "\n", "\n", "### Submission Limits\n", "\n", "Please remember that you are **only allowed three submissions to the autograder.** Your last submission (if you submit 3 or fewer times), or your third submission (if you submit more than 3 times) will determine your grade Submissions that error out will **not** count against this total." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 1\n", "\n", "Import the raw California arrest data from the State Attorney General's office. Please use [this link](https://github.com/nickeubank/MIDS_Data/blob/master/OnlineArrestData1980-2021.csv) (the original is here [here](https://openjustice.doj.ca.gov/data), but they keep updating it and I get tired of updating solutions, so... please use my copy!)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Learning the Group Structure of Your Data\n", "\n", "### Exercise 2\n", "\n", "What is the unit of observation for this dataset? In other words, when row zero says that there were 505 arrests for `VIOLENT` crimes, what exactly is that telling you—505 arrests in 1980? 505 arrests in Alameda County?\n", "\n", "(Please answer in Markdown)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Testing Your Assumptions\n", "\n", "It's important to be able to test whether the data you are working with really is organized the way you think it is, especially when working with groupby. Let's discuss how to check your answer to Exercise 2 with the `.duplicated()` method. \n", "\n", "Consider the following toy data:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
social_security_numberssecond_column
0111111111a
1222222222a
2222222222a
3333333333a
4333333333b
\n", "
" ], "text/plain": [ " social_security_numbers second_column\n", "0 111111111 a\n", "1 222222222 a\n", "2 222222222 a\n", "3 333333333 a\n", "4 333333333 b" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "df = pd.DataFrame(\n", " {\n", " \"social_security_numbers\": [\n", " 111111111,\n", " 222222222,\n", " 222222222,\n", " 333333333,\n", " 333333333,\n", " ],\n", " \"second_column\": [\"a\", \"a\", \"a\", \"a\", \"b\"],\n", " }\n", ")\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we want to see if there are any duplicate rows in the dataset, we can use the `.duplicated()` method:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 False\n", "1 False\n", "2 True\n", "3 False\n", "4 False\n", "dtype: bool" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.duplicated()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see, `.duplicated()` looks at each row and returns `True` if it has seen the row it is looking at before. \n", "\n", "Notice that `.duplicted()` doesn't return `True` for *all* rows that look the same as another row in the data—it treats the first instance of a row as unique, and only tags *subsequent* repetitions as \"duplicates\" (You can change this behavior with the `keep` keyword argument if you want).\n", "\n", "`.duplicated()` can also be used to test for rows that are duplicates in terms of a sub-set of columns. For example, `df.duplicated([\"social_security_numbers\"])` identifies rows with duplicate values of the variable `social_security_numbers`:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 False\n", "1 False\n", "2 True\n", "3 False\n", "4 True\n", "dtype: bool" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.duplicated([\"social_security_numbers\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since `duplicated` is now only looking at `social_security_numbers` (not all values in a row), the last row is now a duplicate (because `333333333` is duplicated), where when we considered all columns, it was not a duplicate. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Testing with .duplicated()\n", "\n", "`.duplicated()` can be paired with the `.any()` method to write `assert` tests to detect the presence of duplicates in our dataset and to validate our assumptions about whether certain variables uniquely identify observations in our data.\n", "\n", "When you run `.any()` on an array of Booleans, it returns a single value of `True` if *any* entries are `True`, and a single value of `False` if *no* entries are `True`. (You can also use `.all()` to test if all entries are `False`).\n", "\n", "Thus the command `df.duplicated(['social_security_numbers'])` will return `False` if `social_security_numbers` uniquely identifies every row in our dataset (since there are no duplicates)! But if there *are* any duplicates, it will return `True`:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.duplicated([\"social_security_numbers\"]).any()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So if I wanted a test that would raise an `Error` if it turned out that `social_security_numbers` *didn't* uniquely identify observations, I would combine that code with `assert not`:\n", "\n", "```python\n", "assert not df.duplicated([\"social_security_numbers\"]).any()\n", "\n", "---------------------------------------------------------------------------\n", "AssertionError Traceback (most recent call last)\n", "Cell In [20], line 1\n", "----> 1 assert not df.duplicated([\"social_security_numbers\"]).any()\n", "\n", "AssertionError: \n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(which in this case raises an Error because `social_security_numbers` *isn't* unique!)\n", "\n", "BTW, if this feels oddly familiar, that's probably because we talked about this same idea when discussing merging. A `1:1` merge is one in which the merging variables uniquely identify observations in both datasets." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 3\n", "\n", "Use `duplicated` to test if the variables *you* think uniquely identify rows in your data really do uniquely identify rows. If you were wrong, update your beliefs (and your code) until you have an `assert` statement that passes." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 4\n", "\n", "Once you have a handle on how the data looks now, please **collapse the data** to be one observation per county-year-racial group. Store the number of rows in the resulting dataframe in your `results` dictionary under the key `\"ex4_num_rows\"`.\n", "\n", "**Hint:** Think carefully about the most appropriate *aggregation function* (e.g., mean, median, sum, count) given the data we're working with!\n", "\n", "**Hint 2:** When using `groupby`, I am a HUGE fan of the `as_index=False` keyword argument (e.g., `df.groupby(\"col1\", as_index=False).mean()`). By default when you use groupby, pandas likes to make a new index out of the grouping columns. As you know, I HATE indices, and this keyword argument prevents pandas from this behavior. To illustrate:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
social_security_numberssecond_columnthird_column
0111111111a100
1222222222a200
2222222222b300
3222222222a400
4333333333b500
\n", "
" ], "text/plain": [ " social_security_numbers second_column third_column\n", "0 111111111 a 100\n", "1 222222222 a 200\n", "2 222222222 b 300\n", "3 222222222 a 400\n", "4 333333333 b 500" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame(\n", " {\n", " \"social_security_numbers\": [\n", " 111111111,\n", " 222222222,\n", " 222222222,\n", " 222222222,\n", " 333333333,\n", " ],\n", " \"second_column\": [\"a\", \"a\", \"b\", \"a\", \"b\"],\n", " \"third_column\": [100, 200, 300, 400, 500],\n", " }\n", ")\n", "df" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
third_column
social_security_numberssecond_column
111111111a100.0
222222222a300.0
b300.0
333333333b500.0
\n", "
" ], "text/plain": [ " third_column\n", "social_security_numbers second_column \n", "111111111 a 100.0\n", "222222222 a 300.0\n", " b 300.0\n", "333333333 b 500.0" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Without as_index=False\n", "df.groupby([\"social_security_numbers\", \"second_column\"]).mean()" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
social_security_numberssecond_columnthird_column
0111111111a100.0
1222222222a300.0
2222222222b300.0
3333333333b500.0
\n", "
" ], "text/plain": [ " social_security_numbers second_column third_column\n", "0 111111111 a 100.0\n", "1 222222222 a 300.0\n", "2 222222222 b 300.0\n", "3 333333333 b 500.0" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# With as_index=False\n", "df.groupby([\"social_security_numbers\", \"second_column\"], as_index=False).mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 5\n", "\n", "You should notice that there are significantly fewer rows in the data now. Given your answer from 3, what groups were you *collapsing* in question 4 (in other words, which different groups were previously split out into separate rows but have now been aggregated into the same rows)?\n", "\n", "Store the names of those variables **in a `list` in alphabetical order** in `results` as `\"ex5_collapsed_vars\"`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Racial Arrest Composition\n", "\n", "The next question we want to ask is: does the racial composition of arrests in each county vary by arrest type? In other words, do Blacks, Hispanics, Whites, and Other make up substantially different portions of the people arrested for drug offenses and violent offenses?\n", "\n", "To answer this question, we need to be able to say what proportion of all drug arrestees were Black/White/Hispanic/Other and what proportion of all violent arrestees were Black/White/Hispanic/Other *for each county and for each year*.\n", "\n", "In trying to do this, we'll need to break the problem down into pieces, starting from our goal (these quantities) and working backwards.\n", "\n", "### Exercise 6\n", "\n", "To calculate the share of all drug and violent arrestees who come from each racial group, what quantities do we need? \n", "\n", "Expressed differently, what two variables (one for drug arrests, one for violent arrests) do we need to add to our data to it allow us to calculate the share of arrestees in each county-year that come from a given racial group?\n", "\n", "Calculate these two variables (one variable for drug arrests, one for violent arrests). \n", "\n", "**Hint:** `transform` should probably make an appearance here." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Hint:** This types of groupby manipulations are an easy place to do things wrong, so it's important to eye-ball your data to be sure you did things right! \n", "\n", "Take a look at the data for, say, Alameda County in 1980, and compare the values that you calculate above with what you get if you try to calculate those same numbers by hand using the raw data from Exercise 5?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 7\n", "\n", "Now calculate the share of violent and drug arrestees in each county and year from each racial group.\n", "\n", "To ensure you're on the right track, store the share (between zero and one) of violent arrestees in Alameda County in 1980 who were Black as `ex7_alameda_1980_share_violent_arrestees_black`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 8\n", "\n", "We're about to start studying this data by plotting the share of violent arrestees that are Black against the share of felony drug arrestees that are Black. But the moment where you finish your data manipulations and are about to start you data analysis is a *great* time to just make sure everything in your data looks good. Let's run a few checks:\n", "\n", "- Are your values of the share of felony arrestees who were arrested for violent crimes sensible?\n", "- You're about to analyze the data using only the rows for the Black racial group. How many unique counties are there with data for Black arrestees? How many for White arrestees? Do you remember how many counties there are in CA (google is your friend if not!)?\n", "\n", "Include assert statements to check all these features of your data.\n", "\n", "(Reminder: the autograder will only work if you are entire notebook runs without errors.)\n", "\n", "If you encounter any problems, make sure you understand why they are happening, and fix them (if a fix is appropriate!)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 9\n", "\n", "Now check to see if you have the same number of counties *in each year for each racial group*.\n", "\n", "Can you figure out why there aren't? (Hint: this isn't trivial to figure out! You'll probably have to do some detective work, and even then you'll have to make a bit of a \"best guess\" at what's going on!)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 10\n", "\n", "Once you think you have a theory of the case, fix the problem. Again, this will NOT be trivial." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 11\n", "\n", "Now that you have corrected your problem, averaging over the values for each county-year, calculate the average share of drug arrestees who are Black and the average share of drug arrestees who are White. Store the results as `\"ex11_white_drug_share\"` and `\"ex11_black_drug_share\"`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 12\n", "\n", "Plot the share of violent arrestees that are Black against the share of felony drug arrestees that are Black. Do they look proportionate? Answer in text here.\n", "\n", "Do Black arrestees make up a greater share of violent arrestees, or a greater share of drug arrestees? Store you answer as `\"greater violent\"`, or `\"greater drug\"` in `\"ex12_proportionate\"`.\n", "\n", "**Hint:** You can add a 45 degree reference line by making a dummy dataset with one point at 0,0 and one point at a location near the top of your data (e.g. 0.7, 0.7). Without a reference line it's VERY hard to be sure if how your points relate to proportionality." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(A quick note of warning on interpretation: these results can tell you whether Black Californians make up a larger proportion of *arrests* for certain types of crimes, not whether they make up a larger proportion of people who *commit* a give type of crime! Those *might* be the same, but they might not... this data just can't answer that question.)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.10.6 ('base')", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.4" }, "vscode": { "interpreter": { "hash": "718fed28bf9f8c7851519acf2fb923cd655120b36de3b67253eeb0428bd33d2d" } } }, "nbformat": 4, "nbformat_minor": 4 }