Coding Your Own Linear Regression Model¶
One task that you will almost certainly be required to do other data science courses (especially if you are a MIDS student) is to write up some of your statistical / machine learning models from scratch. This can be a very valuable exercise, as it ensures that you understand what is actually going on behind the scenes of the models you use ever day, and that you don’t just think of them as “black boxes”.
To get a little practice doing this, today you will be coding up your own linear regression model!
(If you are using this site but aren’t actually in this class, you are welcome to skip this exercise if you’d like – this is more about practicing Python in anticipation of the requirements of other courses than developing your applied data science skills.)
There are, broadly speaking, two approaches you can take to coding up your own model:
you can write the model by defining a new function, or
you can write the model by defining a new class with associated methods (making a model that works the way a model works in
Whether you do 1 or 2 is very much a matter of choice and style. Approach one, for example, is more consistent with what is called a functional style of programming, while approach two is more consistent with an object-oriented style of programming. Python can readily support both approaches, so either would work fine.
In these exercises, however, I will ask you to use approach number 2 as this tends to be the more difficult approach, and so practicing approach 2 will be extra useful in preparing you for other classes (HA! Pun…). In particular, our goal is to implement a linear regression model that has the same “initialize-fit-predict-score” API (application programming interface – a fancy name for the methods a class supports) as
scikit-learn models. That means your model should be able to do all of
Initialize a new model.
Fit a linear model when given a numpy vector (
y) and a numpy matrix (
X) with the syntax
Predict values when given a new
X_test) with the syntax
Return the model coefficients through the property
my_model.coefficients(not quite what is used in
sklearn, but let’s use that interface).
Also, bear in mind that throughout these exercises, we’ll be working in
numpy instead of
pandas, just as we do in
scikit-learn. So assume that before using your model, your user has already converted their data from
(1) Define a new Class called
MyLinearModel with methods for
predict, and an attribute for
coefficients. For now, we don’t need any initialization arguments, just an
As you get your code outline going, start by just having each method
def my_method(self): pass
This will allow your methods to run without errors (they just don’t do anything). Then we can double back to each method to get them working one by one.
(2) Now define your
fit method. This is the method that should actually run your linear regression.
Note that once you have written the code to do a linear regression, you’ll need to put your outputs (your coefficents) somewhere. I recommend making an attribute for your class where you can store your coefficients.
(As a reminder: the normal multiply operator (
numpy implies scalar multiplication. Use
@ for matrix multiplication).
HINT: Remember that linear regressions usually include a constant term. As in most packages, you should assume that users want this included, even if there isn’t a vector of 1s in their
(3) As you write code, it is good to test your code as you work. With that in mind, let’s create some toy data. First, create a 100 x 2 matrix where each column is normally distributed. Then create a vector
y that is a linear combination of those two columns plus a vector of normally distributed noise.
In other words, we want to create data where we know exactly what coefficients we should see so when we test our regression, we know if the results are accurate!
(4) Now test whether you
fit method generates the correct coefficients.
(5) Now let’s make the statisticians proud, and in addition to storing the coefficients, let’s store the standard errors for our estimated coefficients as another attribute.
(6) Now let’s also add an R-squarded attribute to the model.
(7) Now we’ll go ahead and cheat a little. Use
statsmodels to see if your standard errors and r-squared are correct!
(8) Now implement
predict! Then test it against your original
X data – do you get back something very close to your true
(9) Finally, create the option of fitting the model with or without a constant term. As in
scikit-learn, make this an option you set during initialization.