Different outputs for linear regression coefficients

Hi,

I would have two different outputs for the linear regression coefficients with the same code, either running it on this server or my local server.

From this server…

From my local server…

As a matter of fact, I have the same coefficients as with the Ridge regression.

Other things being equal (unless I’m mistaken), do you have any idea of the (possible) reason why I have these two different coefficient outputs?

As mentioned in the solution (you have to click on Look at the solution: below the jupyter notebook to display it):

It indeed means that we try to solve an mathematical ill-posed problem. Indeed, finding coefficients in a linear regression involves inverting the matrix np.dot(data.T, data) which is not possible (or lead to high numerical errors).

The numerical errors may vary according to your processor (see this wikipedia page for instance).

Thank you @ArturoAmorQ for your feedback. I overlooked that statement in the solution.

So I’ve checked the difference, estimating the coefficients from scratch.

From this server…

Indeed the X.T @ X matrix is singular (determinant is exactly 0), so can’t be inverted.

From my local server…

The determinant is close to, but not equal to, 0, so I can still invert here. Note that the coefficients still look reasonable.

Not sure of what happens under the hook of LinearRegression but it anyway let me find the exact same coefficients as with Ridge. Any idea of why that?

Also, shouldn’t we have a warning from LinearRegression output in case the determinant is very close to 0?

What value for alpha are you setting? The default value? Adding the penalty should give better results. For Ridge, the matrix to be inverted internally is np.dot(data.T, data) + alpha * I. Adding this penalty alpha allow the inversion without numerical issues, unless you are setting alpha very close to 0.

I guess so!

No specific setting for Ridge, so the default value of alpha=1.0 is used, and so we definitely have a regularization term.
The question is how, in this case, do Ridge (regularized) and LinearRegression (unregularized)-- with the X.T @ X actually not invertible --yield exactly the same coefficients?

Just to be really clear on this, are you getting the same results with Ridge on the data_expanded that you obtained with LinearRegression on the full data or the data_expanded?

Have you tried reproducing the exact steps of the solution notebook?

I agree this is unexpected. Could you please report a minimal reproduction script on https://gist.github.com along with the outputs of both machines and the versions numbers of all libraries?

You can use:

import sklearn
sklearn.show_versions()

to compare the versions of both environments. That will help us try to reproduce the problem and maybe open an issue on scikit-learn or an upstream library if necessary.

As requested, I provided on the gist the two notebooks, one run here on this server and one run on my machine with same code but different outputs regarding LinearRegression vs. Ridge.

Can you access the gist?

Please kindly keep me posted with the possible reason(s) of this difference.

where?

You are using two different OS. I thought that it could be linked to the blas implementation that differs from one OS to another.

For instance, I recall that in Windows, LinearRegression will lead to the same results as Ridge.

That should only be the case for Ridge(alpha=0). Here this is with Ridge(alpha=1). It seems like a bug but I have not yet investigated the details. It would be great to craft a minimal reproducer in just a few line of Python code with a minimal dataset in order to understand the cause.

You can still get some overfitting with a max_depth=3 or 5 and many trees.

oops, wrong thread? :slightly_smiling_face:

Oops indeed :slight_smile:

I am using windows and I am getting the same results as the OP