{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Merging Data to Understand the Relationship between Drug Legalization and Violent Crime\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In recent years, many US states have decided to legalize the use of marijuana. \n", "\n", "When these ideas were first proposed, there were many theories about the relationship between crime and the \"War on Drugs\" (the term given to US efforts to arrest drug users and dealers over the past several decades). \n", "\n", "In this exercise, we're going to test a few of those theories using drug arrest data from the state of California. \n", "\n", "Though California has passed a number of laws lessening penalities for marijuana possession over the years, arguably the biggest changes were in 2010, when the state changed the penalty for possessing a small amount of marijuana from a criminal crime to a \"civil\" penality (meaning those found guilty only had to pay a fine, not go to jail), though possessing, selling, or producing larger quantities remained illegal. Then in 2016, the state fully legalized marijuana for recreational use, not only making possession of small amounts legal, but also creating a regulatory system for producing marijuana for sale. \n", "\n", "Proponents of drug legalization have long argued that the war on drugs contributes to violent crime by creating an opportunity for drug dealers and organized crime to sell and distribute drugs, a business which tends to generate violence when gangs battle over territory. According to this theory, with drug legalization, we should see violent crime decrease after legalization in places where drug arrests had previously been common. \n", "\n", "**To be clear,** this is far from the only argument for drug legalization! It is simply the argument we are well positioned to analyze today. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Gradescope Autograding\n", "\n", "Please follow [all standard guidance](https://www.practicaldatascience.org/html/autograder_guidelines.html) for submitting this assignment to the Gradescope autograder, including storing your solutions in a dictionary called `results` and ensuring your notebook runs from the start to completion without any errors.\n", "\n", "For this assignment, please name your file `exercise_merging.ipynb` before uploading.\n", "\n", "You can check that you have answers for all questions in your `results` dictionary with this code:\n", "\n", "```python\n", "assert set(results.keys()) == {\n", " \"ex6_validate_keyword\",\n", " \"ex10_merged_successfully\",\n", " \"ex16_num_obs\",\n", " \"ex17_drug_change\",\n", " \"ex18_violent_change\",\n", " \"ex21_diffindiff\",\n", " \"ex23_diffindiff_proportionate\",\n", "}\n", "```\n", "\n", "\n", "### Submission Limits\n", "\n", "Please remember that you are **only allowed three submissions to the autograder.** Your last submission (if you submit 3 or fewer times), or your third submission (if you submit more than 3 times) will determine your grade Submissions that error out will **not** count against this total." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Pre-Legalization Analysis\n", "\n", "### Exercise 1\n", "We will begin by examining [county-level data on arrests from California in 2009](https://github.com/nickeubank/practicaldatascience/tree/master/Example_Data/ca). This data is derived directly from data hosted by the [Office of the California State Attorney General](https://openjustice.doj.ca.gov/data), but please follow the github link above and download and import the file `ca_arrests_2009.csv` (don't try and get it directly from the State Attorney General's office). " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 2\n", "Use your data exploration skills to get a feel for this data. If you need to, you can find the [original codebook here](https://github.com/nickeubank/practicaldatascience/blob/master/Example_Data/ca/arrests_codebook.pdf) (This data is a version of that data, but collapsed to one observation per county.)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 3\n", "Figuring out what county has the most violent arrests isn't very meaningful if we don't normalize for size. A county with 10 people and 10 arrests for violent crimes is obviously worse than a county with 1,000,000 people an 11 arrests for violent crime. \n", "\n", "To address this, also import `nhgis_county_populations.csv` from [the directory we're working from](https://github.com/nickeubank/practicaldatascience/tree/master/Example_Data/ca)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 4\n", "Use your data exploration skills to get used to this data, and figure out how it relates to your 2009 arrest data. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 5\n", "\n", "Once you feel like you have a good sense of the relation between our arrest and population data, merge the two datasets." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 6\n", "\n", "When merging data, the result will only be meaningful if your understanding of how the data sets you are merging relate to one another are correct. In some ways, this is obvious — for example, if the variable(s) that you are using to merge observations in the two datasets or to actually identifying observations that should be linked, then obviously merging using those variables will create a meaningless new dataset.\n", "\n", "But other properties that matter are often more subtle. For example, it's important to figure out whether your merge is a `1-to-1` merge (meaning there is only one observation of the variable you're merging on in both datasets), a `1-to-many` merge (meaning there is only one observation of the variable you're merging on in the first dataset, but multiple observations in the second), or a `many-to-many` merge (something you almost never do). \n", "\n", "Being correct in your assumptions about these things is *very* important. If you think there's only one observation per value of your merging variable in each dataset, but there are in fact 2, you'll end up with two observations for each value after the merge. Moreover, not only is the structure of your data now a mess, but the fact you were wrong means you didn't understand something about your data. \n", "\n", "So before running a merge, it is critical to answer the following questions:\n", "\n", "a) What variable(s) do you think will be consistent across these two datasets you can use for merging? \n", "\n", "b) Do you think there will be exactly 1 observation for each value of this variable(s) in your arrest data?\n", "\n", "c) Do you think there will be exactly 1 observation for each value of this variable(s) in your population data?\n", "\n", "So in markdown, answer these three questions for this data.\n", "\n", "\n", "Then also specify the type of merge you were hoping to accomplish as one of the following strings — `\"one-to-one\"`, `\"one-to-many\"`, `\"many-to-one\"`, or `\"many-to-many\"` — in your `results` dictionary under the key `\"ex6_validate_keyword\"`. Assume that the first dataset we are talking about (e.g., the `one` in `one-to-many`, if that were your selection) is your arrests data and the second dataset (e.g., the `many` in `one-to-many`, if that were your selection)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Merge Validation\n", "\n", "Because of the importance of answering these questions accurately, `pandas` provides a utility for *validating* these assumptions when you do a merge: the `validate` keyword! Validate will accept `\"1:1\"`, `\"1:m\"`, `\"m:1\"`, and `\"m:m\"`. It will then check to make sure your merge matches the type of merge you think it is. I *highly* recommend always using this option (...and not just because I'm the one who added `validate` to pandas).\n", "\n", "*Note:* `validate` only actually tests if observations are unique when a `1` is specified; if you do a `1:1` merge but pass `validate=\"1:m\"`, `validate=\"m:1\"`, or `validate=\"m:m\"`, you won't get an error — a one-to-many merge that turns out to be a one-to-one isn't nearly as dangerous as a one-to-one merge that turns out to be one-to-many.\n", "\n", "### Exercise 7\n", "\n", "Repeat the merge you conducted above, but this time use the `validate` to make sure your assumptions about the data were correct. If you find that you made a mistake, revise your data until the merge you think is correct actually takes place.\n", "\n", "To aid the autograder, please make sure to comment out any code that generates an error.\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 8\n", "\n", "Were your assumptions about the data correct? If not, what had you (implicitly) assumed when you did your merge in Exercise 5 that turned out not to be correct?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Merge Diagnostics\n", "\n", "### Exercise 9\n", "\n", "Checking whether you are doing a 1-to-1, many-to-1, 1-to-many, or many-to-many merge is only the first type of diagnostic test you should run on *every* merge you conduct. The second test is to see if you data merged successfully!\n", "\n", "To help with this, the `merge` function in pandas offers a keyword option called `indicator`. If you set `indicator` to `True`, then pandas will add a column to the result of your merge called `_merge`. This variable will tell you, for each observation in your merged data, whether: \n", "\n", "- that observation came from a successful merge of both datasets, \n", "- if that observation was in the left dataset (the first one you passed) but not the right dataset (the second one you passed), or \n", "- if that observation was in the right dataset but not the left. \n", "\n", "This allows you to quickly identify failed merges!\n", "\n", "For example, suppose you had the following data:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
keydf1_var
0key11
1key22
\n", "
" ], "text/plain": [ " key df1_var\n", "0 key1 1\n", "1 key2 2" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "import numpy as np\n", "\n", "pd.set_option(\"mode.copy_on_write\", True)\n", "\n", "df1 = pd.DataFrame({\"key\": [\"key1\", \"key2\"], \"df1_var\": [1, 2]})\n", "df1" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
keydf2_var
0key1a
1Key2b
\n", "
" ], "text/plain": [ " key df2_var\n", "0 key1 a\n", "1 Key2 b" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2 = pd.DataFrame({\"key\": [\"key1\", \"Key2\"], \"df2_var\": [\"a\", \"b\"]})\n", "df2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now suppose you *expected* that all observations should merge when you merge these datasets (because you hadn't noticed the typo in `df2` where `key2` has a capital `Key2`. If you just run a merge, it works without any problems:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "new_data = pd.merge(df1, df2, on=\"key\", how=\"outer\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And so you might carry on in life unaware your data is now corrupted: instead of two merged rows, you now have 3, only 1 of which merged correctly!" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
keydf1_vardf2_var
0key11.0a
1key22.0NaN
2Key2NaNb
\n", "
" ], "text/plain": [ " key df1_var df2_var\n", "0 key1 1.0 a\n", "1 key2 2.0 NaN\n", "2 Key2 NaN b" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "new_data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When what you really wanted was: " ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
keydf1_vardf2_var
0key11a
1key22b
\n", "
" ], "text/plain": [ " key df1_var df2_var\n", "0 key1 1 a\n", "1 key2 2 b" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2_correct = df2.copy()\n", "df2_correct.loc[df2.key == \"Key2\", \"key\"] = \"key2\"\n", "pd.merge(df1, df2_correct, on=\"key\", how=\"outer\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(in a small dataset, you'd quickly see you have 1 row instead of 2, but if you have millions of rows, a couple missing won't be evident). \n", "\n", "But now suppose we use the `indicator` function:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "_merge\n", "left_only 1\n", "right_only 1\n", "both 1\n", "Name: count, dtype: int64" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "new_data = pd.merge(df1, df2, on=\"key\", how=\"outer\", indicator=True)\n", "new_data._merge.value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We could immediately see that only one observation merged correct, and that one row from each dataset failed to merge!\n", "\n", "Moreover, we can look at the failed merges:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
keydf1_vardf2_var_merge
1key22.0NaNleft_only
2Key2NaNbright_only
\n", "
" ], "text/plain": [ " key df1_var df2_var _merge\n", "1 key2 2.0 NaN left_only\n", "2 Key2 NaN b right_only" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "new_data[new_data._merge != \"both\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Allowing us to easily diagnose the problem. \n", "\n", "**Note:** The `pandas` merge function allows users to decide whether to keep only observations that merge (`how='inner'`), all the observations from the first dataset pasted to merge (`how='left'`), all the observations from the second dataset passed to merge (`how='right'`), or all observations (`how='outer'`):\n", "\n", "![join_types](https://nickeubank.github.io/practicaldatascience_book/_images/3.4.15_merge_types.png)\n", "\n", "But one danger to using the more restrictive options (like the default, `how='inner'`) is that the merge throws away all the observations that fail to merge, and while this may be the *eventual* goal of your analysis, it means that you don't get to see all the observations that failed to merge that maybe you thought *would* merge. In other words, it throws away the errors so you can't look at them! \n", "\n", "So to use `indicator` effectively, you have to:\n", "\n", "- Not use `how=\"inner\"`, and\n", "- Check the values of `_merge` after your merge. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 10\n", "\n", "Now repeat your previous merge using *both* the `validate` keyword *and* the `indicator` keyword with `how='outer'`. \n", "\n", "How many observations successfully merged (were in both datasets)? Store the result in `results` under the key `\"ex10_merged_successfully\"`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 11\n", "\n", "You *should* be able to get to the point that all counties in our arrest data merge with population data. If that did not happen, can you figure out why that did not happen? Can you fix the data so that all arrest data merges with population data?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Comparing Arrest Rates\n", "\n", "### Exercise 12\n", "\n", "Now that we have arrest counts and population data, we can calculate arrest *rates*. For each county, create a new variable called `violent_arrest_rate_2009` that is the number of violent arrests for 2009 divided by the population of the county from 2005-2009, and a similar new variable called `drug_arrest_rate_2009` for drug arrests divided by population." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 13\n", "\n", "Make a scatter plot that shows the relationship between each county's violent arrest rate and it's drug arrest rate. Since we haven't done a lot with plotting yet, feel free to plot in whatever manner feels most comfortable. The easiest, if you're unsure, is just to use the `pandas` inbuilt `.plot()` method. Just specify the `x` keyword with your x-axis variable, the `y` keyword with your y-axis variable, and use `kind=\"scatter\"`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 14\n", "\n", "Based on this simple comparison of 2009 violent arrest rates and drug arrest rates, what might you conclude about the relationship between the illegal drug trade and violent crime?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Comparing with 2018 Arrests\n", "\n", "The preceding analysis can tell us about whether violent crime and the drug trade are correlated, but it doesn't tell us much about whether they are *causally* related. It *could* be the case that people dealing drugs *cause* more violent crime, but it could also be that certain communities, for some other reason, tend to have *both* more drug sales *and* more violent crime. \n", "\n", "To help answer this question, let's examine whether violent crime arrest rates changed in response to drug legalization. In particular, let's do this by comparing violent crime arrest rates in 2009 (before drug legalization) to violent crime arrest rates in 2018 (after drug legalization). If the illegal drug trade causes violent crime, then we would expect the violent crime rate to fall in response to drug legalization.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 15\n", "\n", "Just as we created violent arrest rates and drug arrest rates for 2009, now we want to do it for 2018. Using the data on 2018 arrests (also in the [same repository](https://github.com/nickeubank/practicaldatascience/tree/master/Example_Data/ca) we used before) and the same dataset of population data (you'll have to use population from 2013-2017, as 2018 population data has yet to be released), create a dataset of arrest rates. \n", "\n", "As before, *be careful with your merges!!!*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 16\n", "\n", "Now merge our two county-level datasets so you have one row for every county, and variables for violent arrest rates in 2018, violent arrest rates in 2009, felony drug arrest rates in 2018, and felony drug arrest rates in 2009. Store the number of observations in your final data set in your `results` dictionary under the key `\"ex16_num_obs\"`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 17\n", "\n", "Did drug arrests go down from 2009 to 2018 in response to drug legalization? (they sure better! This is what's called a \"sanity check\" of your data and analysis. If you find drug arrests went *up*, you know something went wrong with your code or your understanding of the situations). \n", "\n", "Store the average county-level change in drug arrests per capita in `results` under the key `\"ex17_drug_change\"`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 18\n", "\n", "Now we want to look at whether violent crime decreased following drug legalization. Did the average violent arrest rate decrease? By how much? (Note: We're assuming that arrest rates are proportionate to crime rates. If policing increased so that there were more arrests per crime committed, that would impact our interpretation of these results. But this is just an exercise, so...)\n", "\n", "Store the average county-level change in violent arrests per capita in `results` under the key `\"ex18_violent_change\"`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 19\n", "\n", "Based on your answers to exercises 17 and 18, what might you conclude about the relationship between the illegal drug trade and violent crime? Did legalizing drugs increase violent crime (assuming arrest rates are a good proxy for crime rates)? Decrease violent crime? Have no effect? " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Difference in Difference Analysis\n", "\n", "The preceding analysis is something we sometimes call a \"pre-post\" analysis, in that it is a comparison of how and outcome we care about (violent arrest rates) changes from before a treatment is introduced (\"pre\") to after (\"post\"). BUT: pre-post comparisons are imperfect. If we knew that violent crime was not going to change at all in a world without drug legalization, then this comparison is perfectly valid. But what if, absent drug legalization, violent crime would have fallen on its own (maybe because of advances in policing practices or a better economy)? Or maybe it would have increased?\n", "\n", "This is actually a very common problem. For example, imagine you're trying to figure out whether taking tylenol helps with headaches. You have a patient with a headache, you give them tylenol, and then the next day you ask them if they still have a headache, and find out that they don't — does that mean that tylenol cured the headache? Maybe... but most headaches eventually resolve on their own, so maybe the headache would have passed with or without the patient taking tylenol! In fact, there's a term for this phenomenon in medicine — the \"natural history\" of a disease, which is the trajectory that we think a disease might follow absent treatment. And the natural history of the disease is almost never for it to stay exactly the same indefinitely.\n", "\n", "(All of this is closely related to the discipline of causal inference, and if it makes your head to hurt, don't worry — that means you're doing it right! We will talk lots and lots more about it in the weeks and months to come.)\n", "\n", "One way to try to overcome this problem is with something called a difference-in-difference analysis. Rather than just looking at whether violent drug arrest rates increase or decrease between 2009 and 2018, we can split our sample of counties into those that were *more* impacted by drug legalization and those that were *less* impacted by drug legalization and evaluate whether we see a greater change in the violent drug arrest rate in the counties that were more impacted. \n", "\n", "What does it mean to have been \"more impacted\" by drug legalization? In this case, we can treat the counties that had higher drug arrest rates in 2009 as counties that were more impacted by drug legalization than those that had low drug arrest rates in 2009. After all, in a county that had no drug arrests, legalization wouldn't do anything, would it? \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### Exercise 20\n", "\n", "First, split our sample into two groups: high drug arrests in 2009, and low drug arrests in 2009 (cut the sample at the average drug arrest rate in 2009). \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### Exercise 21\n", "\n", "Now, determine weather violent crime changed *more* from 2009 to 2018 in the counties that had lots of drug arrests in 2009 (where legalization likely had more of an effect) than in counties with fewer drug arrests in 2009 (where legalization likely mattered less)? \n", "\n", "Calculate this difference-in-difference:\n", "\n", "```\n", "(the change in violent crime rate per capita for counties with lots of drug arrests in 2009) \n", "- (the change in violent crime rate per capita for counties with few drug arrests in 2009)\n", "\n", "```\n", "\n", "Store your \"difference-in-difference\" estimate in your `results` dictionary under the key `\"ex21_diffindiff\"`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 22\n", "\n", "Interpret your difference in difference result." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 23\n", "\n", "The quantity we estimated above is a little difficult to interpret. Rather than calculating the *absolute* change in violent arrest rates per capita, let's calculate the *proportionate* change.\n", "\n", "Calculate:\n", "\n", "```\n", "(the county-level percentage change in violent crime rate with lots of drug arrests in 2009) \n", "- (the county-level percentage change in violent crime rate with few drug arrests in 2009)\n", "```\n", "Store your \"difference-in-difference\" estimate in your `results` dictionary under the key `\"ex24_diffindiff_proportionate\"`. Report your result in percentages, such that a value of `-100` would imply that the average county experienced a 100% decrease in the violent arrest rate." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.10.6 ('base')", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.4" }, "vscode": { "interpreter": { "hash": "718fed28bf9f8c7851519acf2fb923cd655120b36de3b67253eeb0428bd33d2d" } } }, "nbformat": 4, "nbformat_minor": 4 }