Maternal Smoking and Birth Weight

For these exercises, we will borrow some data and exercises from another MIDS course on Statistical Modeling:

These days, it is widely understood that mothers who smoke during pregnancy risk exposing their babies to many health problems. This was not common knowledge fifty years ago. One of the first studies that addressed the issue of pregnancy and smoking was the Child Health and Development Studies, a comprehensive study of all babies born between 1960 and 1967 at the Kaiser Foundation Hospital in Oakland, CA. The original reference for the study is Yerushalmy (1964, American Journal of Obstetrics and Gynecology, pp. 505-518). The data and a summary of the study are in Nolan and Speed (2000, Stat Labs, Chapter 10) and can be found at the book’s website.

There were about 15,000 families in the study. We will only analyze a subset of the data, in particular 1236 male single births where the baby lived at least 28 days. The researchers interviewed mothers early in their pregnancy to collect information on socioeconomic and demographic characteristics, including an indicator of whether the mother smoked during pregnancy. The variables in the dataset are described in the code book here. In this exercise, we will attempt to use this data to answer the following questions:

  1. Do mothers who smoke tend to give birth to babies with lower weights than mothers who do not smoke?

  2. What is a likely range for the difference in birth weights for smokers and non-smokers?

  3. Is there any evidence that the association between smoking and birth weight differs by mother’s race? If so, characterize those differences.

  4. Are there other interesting associations with birth weight that are worth mentioning?

(1) Load the data “smoking.csv”, which includes information on both biometrics of infants at birth, and information on mothers (variables prefixed with the letter “m”), from this MIDS repo. (Yup, I’m giving you CLEAN DATA! I think this is the only time I’ve done this in this course! Enjoy it. :)).

(2) Start by plotting the relationship between infant weight at birth and gestation (length of pregnancy (in days) at time of birth) for both mothers who smoke and those who do not. Limit attention to children who reach at least 225 days of gestation (there aren’t really any observations for parents who smoke for less than that, so we don’t get common support). Does it seem like birthweights tend to be lower for the children of parents who smoke at a given gestational period?

NOTE: You will likely have a problem because of the name of one of your columns. You should probably be able to guess the problem given your experience with Python.

Linear Regression


(3) Now check this relationship using statsmodels. Regress birthweight on gestational period and whether the infant’s mother smoked.

NOTE: if you didn’t hit a problem because of the name of one of your columns before, you likely will now. You should probably be able to guess the problem given your experience with Python.

(4) Now let’s expand our model to also take into account mothers’ pregnancy weight and race (make sure to treat race as a categorical variable! If you’re rusty on categoricals and indicator variables, here’s a little refresher.).

(5) Now let’s test for whether there is an interaction between the mother’s race and the effect of smoking.

Note that race is coded as follows:

mrace    mother’s race or ethnicity
         0-5= white
         6  = mexican
         7 = black
         8 = asian
         9 = mix
         99 = unknown

As most variation in this data is between “white” and other categories, first recode race to be an indicator for white and non-white for easier interpretation.

(6) Using post-regression test syntax (not by running a new regression on a subpopulation), recover the coefficient and t-statistic for whether smoking reduces birth weight for white mothers. How does this coefficient compare to that for non-white mothers?

(7) Now let’s use this model to predict some values. Let’s generate some hypothetical newborns:

newborns = pd.DataFrame({'smoke': [1, 1, 0, 0],
                         'white': [True, False, True, False],
                         'gestation': [253, 300, 248, 287],
                         'mpregwt': [132, 129, 140, 139]})

Using the model you ran above with gestation, smoke, mpregwt, white, and the interaction of white and smoke, predict birth weights for these newborns.

Note that if you have different data types from those in this hypothetical dataset, or different column names, you’ll get an error – just adjust the column names / data types to match the data you used to fit your model.

statsmodels versus R

A quick but important note: the tools that are made available in different packages is often a function of who uses those packages, and how they use them. By and large, nearly all statisticians use R, and so many stats tools (like automatic forward-model-selection or backwards-model-selection) have “convenience implementations” (single functions that do all the things you’d want to accomplish) in R, but aren’t available as convenience packages in statsmodels. That’s because statsmodels was mostly written by economists and social scientists who tend to feel model selection should be a function of theory not statistical performance (not taking sides: just reporting a difference that exists).

To be clear, you can still implement things like forward model selection yourself in Python – just write a loop that tries different regressors and plots the resulting AICs! – but it will often take a little more work. (Indeed, you can find examples of people writing these loops on the web).

This is one of the reasons that languages are sticky: once a group of people have invested in adding all the bells and whistles they like to a language, there are good reasons to not move to another language, even if the other language has some advantages. A statistician who likes the basic language organization of Python more than R, for example, may still stay with R because the packages already do everything a statistician wants to do, and so it’s not worth having to re-implement common tasks in a new language.

Logistic Regression

(8) Now, using statsmodels, evaluate the impact of smoking on the likelihood a child is born prematurely (where “premature” is defined as gestation of less than 252 days).

For obvious reasons, DO NOT USE OUR SUBSET FOR GESTATION GREATER THAN 225 DAYS from above for this section.

For this model, please include mother’s pre-pregnancy weight, smoking status, and whether the mother is White (if you use all the racial categories, the model won’t converge). Please don’t include the interaction we used in the last question.


In our merging exercises, we ran a difference-in-difference analysis on crime in California counties to see if, following drug legalization, there was a larger decline in violent crime in counties that had previously also had high drug arrest rates (to test the idea that violence was being generated by the drug trade, and that legalization would decrease this violence).

In those analyses, we treated counties as equally-weighted units of analysis. If we think that each county is a single “community”, and we think community is shaped at the level of communities (particularly when our community division is related to administrative boundaries that impact policing and government services, and is the case with counties), then this is reasonable. But one might think that crime is determined at the individual level, or maybe neighborhood level, and so big counties should “count more” in our analysis.

Here, let’s use weighted least squares to weight observations based on population. This will allow bigger counties to influence our estimates more.

(1) To begin, download our data on California arrest rates and population from github. You will see I’ve reshaped the data to long format for convenience.

(2) To run our difference-in-difference in a regression framework, we need both an indicator variable for observations that occur after legalization (i.e. year == 2018), and an indicator for the population we consider treated (those who had high drug arrest rates in 2009). Re-create those here.

Note: if you want, you can also use the 2009 drug arrest rates as a continuous variable. This is a kind of “generalized difference-in-difference”, in which we’re just doing the continuous analogue of the thing we do when we split the sample into “high” and “low” 2009 arrest rates.

(3) Now regression the violent arrest rate on your two indicators and their interaction. The coefficient on the interaction is your “difference-in-difference” estimate! If you go back to your old homeworks, you should find the coefficient is exactly what you calculated.

(4) Now let’s do the same analysis, but this time using county population as a weight in a Weighted Least Squares regression (smf.wls). Just pass the keyword argument weights a vector fo weights (in this case, arrests['total_population']). Note that the weights argument is outside of the formula, so you have to pass an actual vector of values, not just the name of a column.

Absolutely positively need the solutions?

Don’t use this link until you’ve really, really spent time struggling with your code! Doing so only results in you cheating yourself.