{
    "cells": [
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "# Maternal Smoking and Birth Weight\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "\n",
                "## Gradescope Autograding\n",
                "\n",
                "Please follow [all standard guidance](https://www.practicaldatascience.org/html/autograder_guidelines.html) for submitting this assignment to the Gradescope autograder, including storing your solutions in a dictionary called `results` and ensuring your notebook runs from the start to completion without any errors.\n",
                "\n",
                "For this assignment, please name your file `exercise_statsmodels.ipynb` before uploading.\n",
                "\n",
                "You can check that you have answers for all questions in your `results` dictionary with this code:\n",
                "\n",
                "```python\n",
                "assert set(results.keys()) == {\n",
                "    \"ex2_birth_weight_lower\",\n",
                "    \"ex3_smoking_coefficient\",\n",
                "    \"ex4_smoking_coefficient\",\n",
                "    \"ex6_compare_white_black\",\n",
                "    \"ex6_compare_white_black_tstat\",\n",
                "    \"ex7_predictions\",\n",
                "    \"ex8_smoking_coefficient\",\n",
                "    \"ex10_num_high_drug_arrest\",\n",
                "    \"ex11_diff_in_diff\",\n",
                "    \"ex12_diff_in_diff_weighted\",\n",
                "}\n",
                "```\n",
                "\n",
                "\n",
                "### Submission Limits\n",
                "\n",
                "Please remember that you are **only allowed three submissions to the autograder.** Your last submission (if you submit 3 or fewer times), or your third submission (if you submit more than 3 times) will determine your grade Submissions that error out will **not** count against this total."
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "\n",
                "## Exercises\n",
                "\n",
                "For these exercises, we will borrow some data and exercises from another MIDS course on Statistical Modeling:\n",
                "\n",
                "These days, it is widely understood that mothers who smoke during pregnancy risk exposing their babies to many health problems. This was not common knowledge fifty years ago. One of the first studies that addressed the issue of pregnancy and smoking was the Child Health and Development Studies, a comprehensive study of all babies born between 1960 and 1967 at the Kaiser Foundation Hospital in Oakland, CA. The original reference for the study is Yerushalmy (1964, American Journal of Obstetrics and Gynecology, pp. 505-518). The data and a summary of the study are in Nolan and Speed (2000, Stat Labs, Chapter 10) and can be found at [the book’s website.](https://www.stat.berkeley.edu/users/statlabs/)\n",
                "\n",
                "There were about 15,000 families in the study. We will only analyze a subset of the data, in particular 1236 male single births where the baby lived at least 28 days. The researchers interviewed mothers early in their pregnancy to collect information on socioeconomic and demographic characteristics, including an indicator of whether the mother smoked during pregnancy. The variables in the dataset are described in the [code book here](https://ids-702-f19.github.io/Course-Website/hw/hw-02.html). In this exercise, we will attempt to use this data to answer the following questions:\n",
                "\n",
                "1. Do mothers who smoke tend to give birth to babies with lower weights than mothers who do not smoke?\n",
                "2. What is a likely range for the difference in birth weights for smokers and non-smokers?\n",
                "3. Is there any evidence that the association between smoking and birth weight differs by mother’s race? If so, characterize those differences.\n",
                "4. Are there other interesting associations with birth weight that are worth mentioning?\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "### Exercise 1\n",
                "Load the data \"smoking.csv\", which includes information on both biometrics of infants at birth, and information on mothers (variables prefixed with the letter \"m\"), from [this MIDS repo](https://github.com/nickeubank/MIDS_Data). (Yup, I'm giving you CLEAN DATA! I think this is the only time I've done this in this course! Enjoy it. :))."
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "### Exercise 2\n",
                "\n",
                "Start by plotting the relationship between infant weight at birth and gestation (length of pregnancy (in days) at time of birth) for both mothers who smoke and those who do not in a single plot. Include regression lines for both populations (again, all on one plot).\n",
                "\n",
                "Limit attention in the plot (and in subsequent questions) to children born after *more than* 225 days of gestation (there aren't really any observations for parents who smoke for less than that, so we don't get common support).\n",
                "\n",
                "Does it seem like birthweights tend to be lower for the children of parents who smoke at a given gestational period? Store your answer as either `\"yes\"` or `\"no\"` in `ex2_birth_weight_lower`.\n",
                "\n",
                "**NOTE:** This dataset has at least one column name that cause issues with Python (and seaborn. and statsmodels)."
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## Linear Regression"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "**PLEASE CONTINUE TO SUBSET FOR GESTATION GREATER THAN 225 DAYS for this section**"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "### Exercise 3\n",
                "Now check this relationship using `statsmodels`. Regress birthweight on gestational period and whether the infant's mother smoked. Is your conclusion from the previous exercise confirmed? Store the coefficient for smoking in results under the key `\"ex3_smoking_coefficient\"`. \n",
                "\n",
                "*Note:* coefficients are stored in the `.params` attribute of your fit model.\n",
                "\n",
                "*Note 2:* you may hit a problem because of the name of one of your columns. You should probably be able to guess the problem given your experience with Python. Also, if you have trouble figuring out how to access the parameters from a fitted model, check out the docs [here](https://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.OLS.html).\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "### Exercise 4\n",
                "Now let's expand our model to also take into account mothers' pregnancy weight and race (make sure to treat race as a categorical variable!). Don't try to do any recoding of `mrace`, just turn it into a categorical. Does your interpretation of the `smoke` coefficient change? Store the new coefficient under the key `\"ex4_smoking_coefficient\"`."
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "### Exercise 5\n",
                "Now let's test for whether there is an interaction between the mother's race and the effect of smoking. \n",
                "\n",
                "Note that race is coded as follows:\n",
                "\n",
                "```\n",
                "mrace    mother’s race or ethnicity\n",
                "         0-5= white\n",
                "         6  = mexican\n",
                "         7 = black\n",
                "         8 = asian\n",
                "         9 = mix\n",
                "         99 = unknown\n",
                "```\n",
                "\n",
                "As most variation in this data is between \"white\" and other categories, first recode race to be an indicator for white and non-white for easier interpretation. "
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "### Exercise 6\n",
                "Now, using [post-regression test syntax](https://www.statsmodels.org/stable/generated/statsmodels.regression.linear_model.OLSResults.t_test.html) (not by running a new regression on a subpopulation), recover the coefficient and t-statistic for *whether smoking reduces birth weight for white mothers.*  Store the t-statistic of your test in `ex6_compare_white_black_tstat`.\n",
                "\n",
                "**Hint:** Read to the BOTTOM of the post-regression test syntax documentation for how to specify hypotheses with strings.\n",
                "\n",
                "How does this coefficient compare to that for non-white mothers? Is the reduction in birth weight associated with smoking for white mothers larger or smaller than for non-white mothers? Store your answer as either `\"larger\"` or `\"smaller\"` under the key `\"ex6_compare_white_black\"`.\n",
                "\n",
                "**Note:** Please ensure your t-stat is stored as a single number, not an array or an array of arrays."
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "### Exercise 7\n",
                "Now let's use this model to predict some values. Let's generate some hypothetical newborns:\n",
                "\n",
                "```python\n",
                "newborns = pd.DataFrame(\n",
                "    {\n",
                "        \"smoke\": [1, 1, 0, 0],\n",
                "        \"white\": [True, False, True, False],\n",
                "        \"gestation\": [253, 300, 248, 287],\n",
                "        \"mpregwt\": [132, 129, 140, 139],\n",
                "    }\n",
                ")\n",
                "```\n",
                "\n",
                "Using the model you ran above with gestation, smoke, mpregwt, white, and the interaction of white and smoke, predict birth weights for these newborns. Store your predictions **in a list** under the key `\"ex7_predictions\"`.\n",
                "\n",
                "Note that if you have different data types from those in this pretend dataset or different column names from those in the dataset you used to fit your model, you’ll get an error—just adjust the column names / data types to match the data you used to fit your model."
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## statsmodels versus R\n",
                "\n",
                "A quick but important note: the tools that are made available in different packages is often a function of who uses those packages, and how they use them. By and large, nearly all statisticians use R, and so many stats tools (like automatic forward-model-selection or backwards-model-selection) have \"convenience implementations\" (single functions that do all the things you'd want to accomplish) in R, but aren't available as convenience packages in `statsmodels`. That's because `statsmodels` was mostly written by economists and social scientists who tend to feel model selection should be a function of theory not statistical performance (not taking sides: just reporting a difference that exists). \n",
                "\n",
                "To be clear, you can still implement things like forward model selection yourself in Python -- just write a loop that tries different regressors and plots the resulting AICs! -- but it will often take a little more work. (Indeed, you can find examples of people [writing these loops on the web](https://planspace.org/20150423-forward_selection_with_statsmodels/)). And that's not to say that `statsmodels` doesn't have a TON of stuff (take a minute to just look over all the [examples of cool things you can do in statsmodels here](http://www.statsmodels.org/stable/examples/index.html)). Just don't be surprised if you ommissions that seem odd given what you see emphasized in stats courses. \n",
                "\n",
                "This is one of the reasons that languages are sticky: once a group of people have invested in adding all the bells and whistles they like to a language, there are good reasons to not move to another language, even if the other language has some advantages. A statistician who likes the basic language organization of Python more than R, for example, may still stay with R because the packages already do everything a statistician wants to do, and so it's not worth having to re-implement common tasks in a new language. "
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## Logistic Regression"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "### Exercise 8\n",
                "Now, using statsmodels, evaluate the impact of smoking on the likelihood a child is born prematurely (where \"premature\" is defined as gestation of *less than* 252 days).\n",
                "\n",
                "For obvious reasons, **DO NOT USE OUR SUBSET FOR GESTATION GREATER THAN 225 DAYS** from above for this section.\n",
                "\n",
                "For this model, please include mother’s pre-pregnancy weight, smoking status, and whether the mother is White (if you use all the racial categories, the model won’t converge). Please don’t include the interaction we used in the last question. Store the coefficient for smoking in `ex8_smoking_coefficient`."
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## Weighting\n",
                "\n",
                "Let's now try using statsmodels on another dataset.\n",
                "\n",
                "In our merging exercises, we ran a difference-in-difference analysis on crime in California counties to see if, following drug legalization, there was a larger decline in violent crime in counties that had previously also had high drug arrest rates (to test the idea that violence was being generated by the drug trade, and that legalization would decrease this violence).\n",
                "\n",
                "In those analyses, we treated counties as equally-weighted units of analysis. If we think that each county is a single \"community\", and we think crime and arrests is shaped at the level of communities (particularly when our community division is related to administrative boundaries that impact policing and government services, and is the case with counties), then this is reasonable.\n",
                "\n",
                "But one might think that crime is determined at the individual level, or maybe neighborhood level, and so big counties should \"count more\" in our analysis.\n",
                "\n",
                "(More formally, when we are fitting a regression model that picks parameter values to minimize the overall sum of squared errors, maybe we should pick values that to a better job of minimizing the errors for populous counties at the expense of not doing quite as good a job of minimizing the errors for smaller counties)\n",
                "\n",
                "Here, let's use weighted least squares to weight observations based on population. This will allow bigger counties to influence our estimates more.\n",
                "\n",
                "### Exercise 9\n",
                "To begin, read in our data on California arrest rates and population from [github](https://github.com/nickeubank/MIDS_Data/blob/master/arrest_rates_and_population.parquet). You will see I've reshaped the data to long format for convenience. This data is stored as a parquet file, so you'll need to install an engine (I usually use `fastparquet` to load the data).\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "### Exercise 10\n",
                "To run our difference-in-difference in a regression framework, we need both an indicator variable for observations that occur after legalization (i.e. year == 2018), and an indicator for the population we consider treated (those who had drug arrest rates above the average 2009 drug arrest rate). Re-create those here. \n",
                "\n",
                "As a check, store the number of counties that have a high drug arrest rate in 2009 under the key `\"ex10_num_high_drug_arrest\"`.\n",
                "\n",
                "*Note:* if you want, you can also use the 2009 drug arrest rates as a continuous variable. This is a kind of \"generalized difference-in-difference\", in which we're just doing the continuous analogue of the thing we do when we split the sample into \"high\" and \"low\" 2009 arrest rates. "
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "### Exercise 11\n",
                "\n",
                "Now regression the violent arrest rate on your two indicators and their interaction. The coefficient on the interaction is your \"difference-in-difference\" estimate! If you go back to your old homeworks, you should find the coefficient is exactly what you calculated. Store this coefficient under the key  `\"ex11_diff_in_diff\"`."
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "### Exercise 12\n",
                "Now let's do the same analysis, but this time using county population as a weight in a Weighted Least Squares regression (`smf.wls`). Just pass the keyword argument `weights` as a vector to weights (in this case, `arrests['total_population']`). Note that the `weights` argument is *outside* the formula, so you have to pass an actual vector of values, not just the name of a column. \n",
                "\n",
                "Does your interpretation of the effects of drug legalization change? Store the new difference-in-difference coefficient under the key `\"ex12_diff_in_diff_weighted\"`."
            ]
        }
    ],
    "metadata": {
        "kernelspec": {
            "display_name": "Python 3.10.6 ('base')",
            "language": "python",
            "name": "python3"
        },
        "language_info": {
            "codemirror_mode": {
                "name": "ipython",
                "version": 3
            },
            "file_extension": ".py",
            "mimetype": "text/x-python",
            "name": "python",
            "nbconvert_exporter": "python",
            "pygments_lexer": "ipython3",
            "version": "3.11.4"
        },
        "vscode": {
            "interpreter": {
                "hash": "718fed28bf9f8c7851519acf2fb923cd655120b36de3b67253eeb0428bd33d2d"
            }
        }
    },
    "nbformat": 4,
    "nbformat_minor": 4
}