Different outputs for linear regression coefficients

qdpham · 23 February 2022 12:09

Hi,

I would have two different outputs for the linear regression coefficients with the same code, either running it on this server or my local server.

From this server…

From my local server…

As a matter of fact, I have the same coefficients as with the Ridge regression.

Other things being equal (unless I’m mistaken), do you have any idea of the (possible) reason why I have these two different coefficient outputs?

ArturoAmorQ · 23 February 2022 13:24

As mentioned in the solution (you have to click on Look at the solution: below the jupyter notebook to display it):

It indeed means that we try to solve an mathematical ill-posed problem. Indeed, finding coefficients in a linear regression involves inverting the matrix np.dot(data.T, data) which is not possible (or lead to high numerical errors).

The numerical errors may vary according to your processor (see this wikipedia page for instance).

qdpham · 23 February 2022 14:53

Thank you @ArturoAmorQ for your feedback. I overlooked that statement in the solution.

So I’ve checked the difference, estimating the coefficients from scratch.

From this server…

Indeed the X.T @ X matrix is singular (determinant is exactly 0), so can’t be inverted.

From my local server…

The determinant is close to, but not equal to, 0, so I can still invert here. Note that the coefficients still look reasonable.

Not sure of what happens under the hook of LinearRegression but it anyway let me find the exact same coefficients as with Ridge. Any idea of why that?

Also, shouldn’t we have a warning from LinearRegression output in case the determinant is very close to 0?

ArturoAmorQ · 23 February 2022 15:52

What value for alpha are you setting? The default value? Adding the penalty should give better results. For Ridge, the matrix to be inverted internally is np.dot(data.T, data) + alpha * I. Adding this penalty alpha allow the inversion without numerical issues, unless you are setting alpha very close to 0.

I guess so!

qdpham · 23 February 2022 16:08

No specific setting for Ridge, so the default value of alpha=1.0 is used, and so we definitely have a regularization term.
The question is how, in this case, do Ridge (regularized) and LinearRegression (unregularized)-- with the X.T @ X actually not invertible --yield exactly the same coefficients?

ArturoAmorQ · 24 February 2022 10:37

Just to be really clear on this, are you getting the same results with Ridge on the data_expanded that you obtained with LinearRegression on the full data or the data_expanded?

Have you tried reproducing the exact steps of the solution notebook?

ogrisel · 24 February 2022 15:04

I agree this is unexpected. Could you please report a minimal reproduction script on https://gist.github.com along with the outputs of both machines and the versions numbers of all libraries?

You can use:

import sklearn
sklearn.show_versions()

to compare the versions of both environments. That will help us try to reproduce the problem and maybe open an issue on scikit-learn or an upstream library if necessary.

qdpham · 25 February 2022 13:37

As requested, I provided on the gist the two notebooks, one run here on this server and one run on my machine with same code but different outputs regarding LinearRegression vs. Ridge.

Can you access the gist?

Please kindly keep me posted with the possible reason(s) of this difference.

ogrisel · 25 February 2022 14:08

where?

qdpham · 25 February 2022 14:29

gist.github.com

https://gist.github.com/qdpham/21b30181a8e0dd50cfcbba57b61e9b70

fun_linear_models_ex_04.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "c2a558d2",
   "metadata": {},
   "source": [
    "# 📝 Exercise M4.04\n",
    "\n",
    "In the previous notebook, we saw the effect of applying some regularization\n",

This file has been truncated. show original

my_linear_models_ex_04.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "3f36b5d0",
   "metadata": {},
   "source": [
    "# 📃 Solution for Exercise M4.04\n",
    "\n",
    "In the previous notebook, we saw the effect of applying some regularization\n",

This file has been truncated. show original

glemaitre58 · 28 February 2022 13:02

You are using two different OS. I thought that it could be linked to the blas implementation that differs from one OS to another.

For instance, I recall that in Windows, LinearRegression will lead to the same results as Ridge.

ogrisel · 28 February 2022 13:17

That should only be the case for Ridge(alpha=0). Here this is with Ridge(alpha=1). It seems like a bug but I have not yet investigated the details. It would be great to craft a minimal reproducer in just a few line of Python code with a minimal dataset in order to understand the cause.

glemaitre58 · 28 February 2022 15:23

You can still get some overfitting with a max_depth=3 or 5 and many trees.

qdpham · 28 February 2022 15:30

oops, wrong thread?

glemaitre58 · 28 February 2022 15:31

Oops indeed

gpareja · 5 March 2022 20:41

I am using windows and I am getting the same results as the OP