Robust regression (python)

Aug 2021 by Francisco Juretig

The objective of linear regression is to find the coefficients: B1,B2,...,Bk of the following problem Y = B1 * x1 + B2 * x2 + Bk * xk + error. It's in essence an optimization problem where we find these coefficients that minimise the sum of the square residuals (difference between the predictions and the observed values). It is called OLS (ordinary least squares). Under quite general conditions, these coefficients exist, are unique and we can obtain confidence intervals on them. However there is an accounted problem that could affect the quality of what we are doing and it is the presence of outliers. These abnormal values can cause our coefficients to change by a lot. This is not good, as it implies that our model really depends on a few cases that might very well be just exceptions. In order to do this exercise we will use the statsmodels library in python. By the way, you can find a nice intro to robust regression on here (in R) .

In this case, we will run an example where we test several OLS models in python using the lm() function. We add different levels of contamination to the data (we just grab a few cases and replace them by abnormal values) and we test how well the lm and the rlm functions work. As we will see, even a few cases totally corrupt the lm model, whereas the rlm function yields almost the same results. This rlm function, has essentially one main parameter, where we specify the kernel for it. In this case, we will run an example where we test several OLS() models in python using the OLS() function from statsmodels . We add different levels of contamination to the data (we just grab a few cases and replace them by abnormal values) and we test how well the OLS() and the RLM() functions work. As we will see, even a few cases totally corrupt the OLS model, whereas the RLM function yields almost the same results. This RLM() function, has essentially one main parameter, where we specify the kernel for it.

We separated the project into two parts. Here on the right we have two panels. On the top one we are defining our dataset (the famous diabetes dataset from sklearn), Note that we are just keeping one feature to make things simpler. On the lower panel we are using the OLS model from statsmodels. Pretty simple right? Pay attention to the x1 coefficient, which is here estimated at 949. Bear in mind there is no outliers here, nor any type of contamination



On this left side, we have a top panel where we (again) load the diabetes dataset. We also add three outliers and put some very big values there. Note that we have two panels connected here. On the left we have a standard regression, and on the right we have a robust regression model. Let's see what happens in the next panel.

Let's look at the standard regression. As we can see the x1 coefficient is now 787 which is radically different from what we had before the contamination. It just took three observations to completely corrupt the model.

Let's look at the robust regression. Here we get an x1 coefficient estimated at 975 which is very close to the original OLS estimate we had. The outliers, didn't cause the coefficient to move.

Prefer a video?

Here we use rlm in python using statsmodels

You will also find the code, project and files that you can download from a github repo.

https://github.com/fjuretig/amazing_data_science_projects.git