Machine Learning with Scikit-Learn

In these exercises, we’ll learn to fit and evaluate (in a basic way) machine learning models using the package scikit-learn.

The emphasis of these exercises is to help you get comfortable with the data wrangling component of machine learning so that in future courses you can focus on the theory underlying machine learning. With that in mind, we will be quite cavalier with model fitting and evaluation. As with our statsmodels exercises, in which we quickly ran through a few models to practice model implemention without thinking too much about model selection and specification, this is not how you should do your actual data science analyses!

Though this is true generally, it is doubly true in the context of these exercises: the application of machine learning to medicine.

In these exercises, we will use the birth-weight data we used for our statsmodels exercises to predict birth weights. As you will see, implementing a machine learning model to predict birth weights is actually surprisingly straight-forward. But that ease is deceptive, because while machine learning algorithms are easy to use, they’re hard to use well, and if you get them wrong in contexts where they impact real people, poorly implemented machine learning models can literally kill people.

Lest you think I’m being hyperbolic, consider the case of a machine learning model used by medical providers across the US to make treatment decisions for millions of people. The goal of the model (distributed by a company called Optum) was to help providers figure out what patients were especially likely to face health problems down the road so they could provide these patients extra preventative care.

The problem, though, is that the way Optum did this was by training a supervised machine learning model to predict future health care use by patients. Patients the model predicted would consume more healthcare in the future, the model implicitly assumed, were those who should get extra care today.

But as was recently described in a paper in the journal Science, the problem is that the model had a large racial bias, and was less likely to recommend extra preventative care for Black patients.

The reason was that Black patients in the United States tend to use the medical system less than White patients for non-medical reasons (likely including skepticism towards the medical establishment due to the history of Black Americans being used as unknowing test subjects for medical studies, or the fact that Black Americans tend to have lower incomes and are less likely to be insured than White Americans).

But when this model saw that Black patients didn’t consume as much healthcare in the future, the algorithm interpreted that as evidence that Black patients were healthier (not poorer, or skeptical of the medical system). As a result, it became less likely to recommend future care for these patients.

Crucially, this occurred even though race wasn’t even a variable in the model. Machine learning models are very good at picking up subtle signals, and so even though patient race wasn’t an explicit factor in the model, the model was nevertheless able to differential Black and White patients. Though it’s not clear exactly how it did so, this can happen whenever variables are included in models that are correlated with race. For example, people’s zip codes (which identify where people live) are notorious for proxying for race in machine learning algorithms since residential segration in the US means that most people in a given zipcode are of the same race.

And so as a result, this well-meaning machine learning algorithm resulted in black patients receiving fewer preventative medical interventions than white patients, even after taking into account other (medically relevant) factors.

So: in this exercise we’ll play with predict birth weights in infants. But do NOT think that just because it’s this easy to fit a model, it is appropriate to then go use these in the real world in contexts where people’s lives are affected!

(1) Load the data “smoking.csv”, which includes information on both biometrics of infants at birth, and information on mothers (variables prefixed with the letter “m”), from this MIDS repo. We’ll be working with this data in this exercise.

Formatting Your Data

Unlike in statsmodels, we can’t use pandas DataFrames in scikit-learn, so the first step in nearly any machine learning workflow (if you haven’t already been given a nice giant numpy array) is to convert our heterogeneous pandas array (which includes strings, categorical variables, integers, and floating point numbers) into a single large matrix that consists only of floating point numbers.

While you can do this by hand, this is most easily accomplished using the Patsy library, which will take a pandas array and a special formula string and return numpy arrays for use in libraries like scikit-learn. (patsy is actually the library that implemented the formulas we used in statsmodels to specify our regression models – here we’re just using it on its own).

Let’s assume that for most of these exercises, we want to predict birth weight (bwt.oz) using:

  • whether the mother is white, black, hispanic or of another ethnicity, (you have to code from mrace – make sure you treat this as categorical!).

  • whether the mother smokes (smoke)

  • Mother’s age (mage)

  • Mother’s weight (mpregwt)

  • Mother’s height (mht)

For race, recall that in the raw data, mrace is coded as:

mrace    mother’s race or ethnicity
         0-5= white
         6  = mexican
         7 = black
         8 = asian
         9 = mix
         99 = unknown

(We’re ignoring gestation because we don’t really know the value of gestation before the child is born, so we can’t use it to predict the birthweight of not-yet-born children!)

(2) Begin by using patsy.dmatrices() to create two datasets (y, which is the birth weights, and X, which is a matrix with all your “features” in a nice numpy array).

(3) Look at your features matrix X. How many columns does it have? How does that compare to the number of variables you used as inputs? (if they’re the same, you probably did something wrong…). Can you explain the difference?

If not read this (mostly on how this works in data) and/or this (on interpretation of indicator variables). This is one of the very nice things that patsy does for us!

Splitting Your Data

In machine learning, model selection is often accomplished by:

  1. Splitting your data into two parts (a training set and a test set),

  2. Training your model on the training set (i.e. set the parameters of your model to best explain the training data).

  3. Test the model by using the parameters generated during that training to predict values for the testing data, then comparing the predicted values for the testing data to the actual values in the test data.

So suppose we just wanted to use linear regression as our model. We’d randomly pick half the rows of our data, then regress birth weight on the various variables (“features” in machine learning terminology) we specified above. Then we’d use the coefficients from that regression to predict birth weights for the half of children we didn’t use in our estimation, and see how different those predictions are from actual birth weights. If we find a model that performs well on our testing data, then we assume / hope that that model will also work well on new data (i.e. on children who haven’t been born yet whose weight we want to predict).

(Readers from a statistics background will recognize this is a kind of “cross-validation”, though a very simple version.)

So the first step in machine learning is to split our sample! Thankfully this is easy to do with the train_test_split function. So import it with from sklearn.model_selection import train_test_split, and split your data. To give you a sense of how it works, this is a common syntax:

from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X, y,

Where X_train is your training features, Y_train are your training birth weights, X_test are your test features, and Y_test are your test birth weights. The random_state var just ensures that you can re-create this split if you have to re-run your code (helpful for debugging).

(4) So start by splitting YOUR data.

Training your Model

And now it’s time to train our model!

scikit-learn is much loved because it has, like… every model ever already built in, and it provides a common interface (API) for all of them. Seriously – go check out all the supervised machine learning models that come with scikit-learn here.

Moreover, unlike many open-source projects, all of its models are really well documented, so you can read all about them! And check out this nifty guide to choosing an appropriate model.

For this exercise, let’s start by fitting a LinearRegression model.

Wait, you say: isn’t that what we did in statsmodels? Yes!

Data Science is a very fragmented little world, and some stuff gets recapitulated in slightly different wants in many different places, so it’s common to see different presentations of the same thing as you move from the world of statisticians to the world of computer scientists (i.e. machine learning).

(5) Import the Linear Regression model and instantiate it with code like:

from sklearn.linear_model import LinearRegression
my_model = LinearRegression()

(6) Now fit your model against X and y. (If you’re unsure how to do this, read the docs for the model and look at the examples at the bottom!

Note: In statsmodels, the .fit() method returned a new fitted model. In sklearn, by contrast, .fit modifies (mutates) the model in place.

Machine learning, more than absolutely anything else, is concerned with predicting values, and that’s evident in what functionality is exposed by this linear model. As you may recall, in statsmodels, you could type .summary() and get something that looked like this:


A full printout of various dignostics, all your coefficients, estimates of confidence intervals for each coefficient, etc. etc. By contrast, LinearRegression from sklearn has no summary method. Indeed, the only output you really get for what the model has actually fit is my_model.coef_, which looks like:

>, y_train)
> my_model.coef_

array([[ 0.00000000e+00,  4.88967789e+00, -9.57549359e+00,
        -7.78152768e+00, -8.15196740e+00,  8.70134871e-03,
         1.31058392e-01,  1.01948361e+00]])

Which I think we can all agree is not nearly as informative a print-out!

To be clear, you can recover many of the diagnostics for LinearRegression by digging around in other corners of sklearn, but what is made available speaks to the prioritizes of different users: sklearn is for making predictions; statsmodels is for statistics and understanding mechanisms (i.e. seeing if the coefficient on smoking is significant).

(7) OK, but we’re in the world of sklearn, so let’s do some prediction! Now that you’ve fit your model, use the predict method your data to create a set of predictions.

Evaluating your Model

So we now have a trained model that we can use to predict birthweights. Yay! But is it any good?

All sklearn models have a method called score you can used to get the most basic evaluation of your model. The syntax is just:

my_model.score(X_test, y_test)

If you’re doing a classification model (something that tries to guess the category for each observation, like a model that evalutes a set of pictures and tries to figure out if the pictures are of cats, dogs, or humans), score will return an “accuracy” score (the percentage of observations you properly classified). For a regression model (trying to guess a continuous variable) it will give an R-squared score.

As you get more sophisticated, you will discover these basic scores are often inadequate for evaluating models, and you can turn to other evaluation functions found in sklearn.metrics. But for now we’ll just use the default score output of R-squared.

(8) What is the score of your model?

Machine Learning Workflow Summary

Congratulations! You just did you just fit your machine learning algorithm! And you also learned that sometimes what constitutes “machine learning” is in the eye of the beholder, given what you did today is the same thing you did in our last class without calling it machine learning. :)

But hopefully that’s given you a general sense for the work-flow of scikit-learn:

  1. Prep your data:

import patsy
y , X = patsy.dmatrices('bwt_oz ~ C(race_recoded) + smoke + gestation + mage + mpregwt + mht', smoking_and_bw)
  1. Split your data:

from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X, y,
  1. Import and fit a model:

from sklearn.linear_model import LinearRegression
my_model = LinearRegression(), y_train)
  1. Evaluate your model:

my_model.score(X_test, y_test)
  1. Use youre model to make predictions:

my_predictions = my_model.predict(X_test)

Comparing Models

Now that we have a baseline estimate for the performance of LinearRegression for this set of features and outputs, let’s try a different model and see how it compares!

(9) Now repeat your analysis using a Support Vector Regression (from sklearn.svm import SVR). How does the model perform? Is it better or worse than LinearRegression?

(10) One choice parameter for SVRs is the kernel it uses for weighting (again, this isn’t a class on machine learning, so don’t worry too much about what this means – just know that it’s a parameter of the model). Check the SVR documentation to figure out how to set the kernel to linear and see how it performs now.

(11) Now pick whatever regression model you’d like and see how it performs (some suggestions). Play with your model specifications and see how well you can do with your new model of one of the ones we used above.

Want More Practice?

Try replicating our attempts to predict whether infants would be born premature from the statsmodels exercises in scikit-learn. Start with a LogisticRegression, then try some different “classification models” for comparison!

Absolutely positively need the solutions?

Don’t use this link until you’ve really, really spent time struggling with your code! Doing so only results in you cheating yourself.