Machine Learning with Scikit-Learn

Gradescope Autograding

Please follow all standard guidance for submitting this assignment to the Gradescope autograder, including storing your solutions in a dictionary called results and ensuring your notebook runs from the start to completion without any errors.

For this assignment, please name your file exercise_sklearn.ipynb` before uploading.

You can check that you have answers for all questions in your results dictionary with this code:

assert set(results.keys()) == {
    "ex3_num_columns",
    "ex6_model_smoking",
    "ex7_first_prediction",
    "ex8_score",
    "ex9_SVR_score",
    "ex10_SVR_linear_score",
}

Submission Limits

Please remember that you are only allowed three submissions to the autograder. Your last submission (if you submit 3 or fewer times), or your third submission (if you submit more than 3 times) will determine your grade Submissions that error out will not count against this total.

Machine Learning in Medicine

In these exercises, we’ll learn to fit and evaluate (in a basic way) machine learning models using the package scikit-learn.

The emphasis of these exercises is to help you get comfortable with the data-wrangling component of machine learning so that in future courses you can focus on the theory underlying machine learning. With that in mind, we will be quite cavalier with model fitting and evaluation. As with our statsmodels exercises—in which we quickly ran through a few models to practice model implemention without thinking too much about model selection and specification—this is not how you should do your actual data science analyses!

Though this is true generally, it is doubly true in the context of these exercises: the application of machine learning to medicine.

In these exercises, we will use the birth-weight data we used for our statsmodels exercises to predict birth weights. As you will see, implementing a machine learning model to predict birth weights is actually surprisingly straightforward. But that ease is deceptive; while machine learning algorithms are easy to use, they’re hard to use well, and if you get them wrong in contexts where they impact real people, poorly implemented machine learning models can literally kill people.

Lest you think I’m being hyperbolic, consider the case of a machine learning model used by medical providers across the US to make treatment decisions for millions of people. The goal of the model (distributed by a company called Optum) was to help providers figure out what patients were especially likely to face health problems down the road, so they could provide these patients extra preventative care.

The problem, though, is that the way Optum did this was by training a supervised machine learning model to predict future healthcare use by patients. Patients the model predicted would consume more healthcare in the future, the model implicitly assumed, were those who should get extra care today.

But as was recently described in a paper in the journal Science, the problem is that the model had a large racial bias, and was less likely to recommend extra preventative care for black patients.

The reason was that Black patients in the United States tend to use the medical system less than White patients even when the Black patients have the same level of medical need as White patients. Why? The paper doesn’t explore that fully, but the causes likely include skepticism towards the medical establishment due to the history of Black Americans being used as unknowing test subjects for medical studies, or the fact that Black Americans tend to have lower incomes and are less likely to be insured than White Americans.

So when this model saw that Black patients didn’t consume as much healthcare in the future, the algorithm interpreted that as evidence that Black patients didn’t need healthcare in the future. This resulted in the model falsely telling doctors that Black patients were systematically less in need of preventative care.

Crucially, this occurred even though race wasn’t even a variable in the model. Machine learning models are very good at picking up subtle signals, and so even though patient race wasn’t an explicit factor in the model, the model was nevertheless able to differentiate Black and White patients. Though it’s not clear exactly how it did so, this can happen whenever variables are included in models that are correlated with race. For example, people’s zip codes (which identify where people live) are notorious for proxying for race in machine learning algorithms since residential segregation in the US means that most people in a given zip code are of the same race.

And so as a result, this well-meaning machine learning algorithm resulted in black patients receiving fewer preventative medical interventions than white patients, even after taking into account other (medically relevant) factors.

So: in this exercise, we’ll play with predicting birth weights in infants. But do NOT think that just because it’s this easy to fit a model, it is appropriate to then go use these in the real world in contexts where people’s lives are affected!

Exercises

(1) Load the data “smoking.csv”, which includes information on both biometrics of infants at birth, and information on mothers (variables prefixed with the letter “m”), from this MIDS repo. We’ll be working with this data in this exercise.

Formatting Your Data

Unlike in statsmodels, we can’t use pandas DataFrames in scikit-learn, so the first step in nearly any machine learning workflow (if you haven’t already been given a nice giant numpy array) is to convert our heterogeneous pandas array (which includes strings, categorical variables, integers, and floating point numbers) into a single large matrix that consists only of floating point numbers.

While you can do this by hand, this is most easily accomplished using the patsy library, which will take a pandas array and a special formula string and return numpy arrays for use in libraries like scikit-learn. (patsy is actually the library that implemented the formulas we used in statsmodels to specify our regression models—here we’re just using it on its own).

Let’s assume that for most of these exercises, we want to predict birth weight (bwt.oz) using:

  • whether the mother is (a) White, (b) Black, (c) Hispanic or (d) of another ethnicity (in other words, please recode mrace into four categories)

  • whether the mother smokes (smoke)

  • Mother’s age (mage)

  • Mother’s weight (mpregwt)

  • Mother’s height (mht)

We won’t use any interaction terms in this exercise, just these entered on their own.

For race, recall that in the raw data, mrace is coded as:

mrace    mother’s race or ethnicity
         0-5= white
         6  = mexican
         7 = black
         8 = asian
         9 = mix
         99 = unknown

Re-group mrace to be White(0-5), Hispanic (6), Black(7), other (8, 9, 99)

(We’re ignoring gestation because we don’t really know the value of gestation before the child is born, so we can’t use it to predict the birthweight of not-yet-born children!)

(2) Begin by using patsy.dmatrices() to create two datasets (y, which is the birth weights, and X, which is a matrix with all your “features” in a nice numpy array).

Note: You may hit a similar issue using patsy here you encountered when writing formulas in the statsmodels exercise.

(3) Look at your features matrix X. How many columns does it have? Store your answer as ex3_num_columns. How does that compare to the number of variables you used as inputs? (if they’re the same, you probably did something wrong…).

Explain all the things patsy has done with your data (hint: it’s made TWO changes affecting the number of columns in your data).

Splitting Your Data

In machine learning, model selection is often accomplished by:

  1. Splitting your data into two parts (a training set and a test set),

  2. Training your model on the training set (i.e. set the parameters of your model to best explain the training data).

  3. Testing the model by using the parameters generated during that training to predict values for the testing data, then comparing the predicted values for the testing data to the actual values in the test data. The smaller the difference between the predicted values and the true values in the testing data, the better the model is said to perform.

Suppose we just wanted to use linear regression as our model. First, we’d randomly pick 80% our data, then regress birth weight on the various variables (“features” in machine learning terminology) we specified above. Then we’d use the coefficients from that regression to predict birth weights for the 20% of children we didn’t use in our estimation, and see how different those predictions are from actual birth weights. If we find a model that performs well on our testing data, then we assume (hope) that that model will also work well on new data (i.e. on children who haven’t been born yet whose weight we want to predict).

(Readers from a statistics background will recognize this as being a little analogous to “cross-validation.”)

So the first step in machine learning is to split our sample! Thankfully this is easy to do with the sklearn train_test_split function. Import the function with from sklearn.model_selection import train_test_split, and split your data. To give you a sense of how it works, this is a common syntax:

from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X, y,
                                                    test_size=0.2,
                                                    train_size=0.8,
                                                    random_state=42)

Where X_train is your training features, Y_train are your training birth weights, X_test are your test features, and Y_test are your test birth weights. The random_state var just ensures that you can re-create this split if you have to re-run your code (helpful for debugging and for our autograder).

(4) Start by splitting YOUR data, putting 80% into the training data. Set your random state with the seed 42 so your random split is the same as will be used by the autograder.

Training your Model

Now it’s time to train our model!

scikit-learn is much loved because it has, like… every model ever already built in, and it provides a common interface (API) for all of them. Seriously—go check out all the supervised machine learning models that come with scikit-learn here.

Moreover, unlike many open-source projects, all of its models are really well documented, so you can read all about them. And check out this nifty guide to choosing an appropriate model.

For this exercise, let’s start by fitting a LinearRegression model.

Wait, you say: isn’t that what we did in statsmodels? Yes!

Data Science is a very fragmented discipline, and some stuff gets recapitulated in slightly different ways in different places, so it’s common to see different implementations of the same thing as you move from the world of statisticians (statsmodels) to the world of computer scientists (i.e. machine learning and scikit-learn).

(5) Import the Linear Regression model and instantiate it with code like:

from sklearn.linear_model import LinearRegression
my_model = LinearRegression(fit_intercept=False)

The fit_intercept=False tells scikit-learn you don’t want it to fit an intercept for your model behind the scenes. We use False because patsy included an intercept in our design matrix (a column of 1s), and we want to see the coefficient on that column in our coefficients.

(6) Now fit your model against X and y and take a look at the coefficients.

Machine learning, more than absolutely anything else, is concerned with predicting values, and that’s evident in what functionality is exposed by this implementation of linear regressions. As you may recall, in statsmodels, you could type .summary() and get something that looked like this:

statsmodel_output

A full printout of various diagnostics, all your coefficients, estimates of confidence intervals for each coefficient, etc.

By contrast, LinearRegression from sklearn has no summary method. Indeed, the only output you really get for what the model has actually fit is my_model.coef_, which will look something like (not real values or number of coefficients you should get here):

> my_model.fit(X_train, y_train)
> my_model.coef_

array([[ 7.89347789e+00, -12.3442359e+00,
        -8.78152768e+00, 3.215196740e+00]])

Which I think we can all agree is not nearly as informative a print-out!

To be clear, you can recover many of the diagnostics for LinearRegression by digging around in other corners of sklearn, but what is made available speaks to the prioritizes of different users: sklearn is for making predictions; statsmodels is for statistics and understanding mechanisms (i.e. seeing if the coefficient on smoking is significant).

Store your model’s coefficients for smoking in ex6_model_smoking. (Note how much trickier this is here than with statsmodels! sklearn doesn’t think anyone would ever care about individual coefficients).

Note: In statsmodels, the .fit() method returned a new fitted model. In sklearn, by contrast, .fit modifies (mutates) the model in place.

(7) OK, but we’re in the world of sklearn, so let’s do some prediction! Now that you’ve fit your model, use the predict method your data to create a set of predictions. Store the first (index 0) predicted birthweight in ex7_first_prediction.

Evaluating your Model

So we now have a trained model that we can use to predict birthweights. Yay! But is it any good?

All sklearn models have a method called score you can used to get the most basic evaluation of your model. The syntax is just:

my_model.score(X_test, y_test)

If you’re doing a classification model (something that tries to guess the category for each observation, like a model that evalutes a set of pictures and tries to figure out if the pictures are of cats, dogs, or humans), score will return an “accuracy” score (the percentage of observations you properly classified). For a regression model (trying to guess a continuous variable) it will give an R-squared score.

As you get more sophisticated, you will discover these basic scores are often inadequate for evaluating models, and you can turn to other evaluation functions found in sklearn.metrics. But for now we’ll just use the default score output of R-squared.

(8) What is the score of your model? Store the result as ex8_score

Machine Learning Workflow Summary

Congratulations! You just did you just fit your machine learning algorithm! And you also learned that sometimes what constitutes “machine learning” is in the eye of the beholder, given what you did today is the same thing you did in our last class without calling it machine learning. :)

But hopefully that’s given you a general sense for the work-flow of scikit-learn:

  1. Prep your data:

import patsy
y , X = patsy.dmatrices('bwt_oz ~ C(race_recoded) + smoke + mage + mpregwt + mht', smoking_and_bw)
  1. Split your data:

from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X, y,
                                                    test_size=0.2,
                                                    train_size=0.8,
                                                    random_state=42)
  1. Import and fit a model:

from sklearn.linear_model import LinearRegression
my_model = LinearRegression()
my_model.fit(X_train, y_train)
  1. Evaluate your model:

my_model.score(X_test, y_test)
  1. Use youre model to make predictions:

my_predictions = my_model.predict(X_test)

Comparing Models

Now that we have a baseline estimate for the performance of LinearRegression for this set of features and outputs, let’s try a different model and see how it compares!

(9) Now repeat your analysis using a Support Vector Regression (from sklearn.svm import SVR).

How does the model perform? Is it better or worse than LinearRegression?

Note: You’ll probably get a DataConversionWarning: A column-vector y was passed when a 1d array was expected. This is related to this common gotcha. Make sure you understand the problem / are able to resolve the warning.

Why didn’t you get this with LinearRegression? That I can’t say! They must each be implemented slightly differently beneath the hood.

ROUND your value of the score of this fit model result to 2 decimal places and store as "ex9_SVR_score". (SVR is not deterministic, and it doesn’t seem to respect our random seeds, so values beyond this may differ a little across computers.)

(10) One choice parameter for SVRs is the kernel it uses for weighting (again, this isn’t a class on machine learning, so don’t worry too much about what this means—just know that it’s a parameter of the model). Check the SVR documentation to figure out how to set the kernel to linear and see how it performs now.

Store the score of the SVR with linear kernel as "ex10_SVR_linear_score". ROUND your value of the score of the SVR with linear kernel to 2 decimal places and store as "ex10_SVR_linear_score"

(11) Now pick whatever regression model you’d like and see how it performs (some suggestions). Play with your model specifications and see how well you can do with your new model of one of the ones we used above.

Want More Practice?

Try replicating our attempts to predict whether infants would be born premature from the statsmodels exercises in scikit-learn. Start with a LogisticRegression, then try some different “classification models” for comparison!