R2 score and a little bug with the font

geogeo14000 · 18 June 2021 07:17

Good morning,

First I have a question about “The 𝑅2 score represents the proportion of variance of the target that is explained by the independent variables in the model. The best score possible is 1 but there is no lower bound. However, a model that predicts the expected value of the target would get a score of 0.”

It’s a bit obscur for me. what’s the link with proportion of the target and the independent variables in the model ? what does it mean and how it is explained by that ?

And for the score, it’s said a model that predicts the expected value of the target would get a score of 0, so why the best score is 1 while correct prediction gives score of 0 ? Can you give some more details about the scoring plz.

Finally, I just want to report a little problem of font I think in the cell about the mean absolute error that “still have a limitation” when it’s written error of 50k for an house…etc."

Thank you,

Geoffrey.

glemaitre58 · 19 June 2021 20:21

I think that I like the definition on Wikipedia:

In statistics, explained variation measures the proportion to which a mathematical model accounts for the variation (dispersion) of a given data set.

So, with a predictive model, we would like to get an accurate model predicting perfectly our target and thus all the variation of the original data. However, it is usually not the case. We measure an error: the residuals that are the parts of the variation of the original data that our model cannot explain. Finally, the total variation is the sum of the explained variation and the residual variation.

The R2 square is thus a ratio between the explained variation by the model and the total variations.

You can check the Wikipedia figure:

The R2 score is defined as:

R2 = 1 - ("residual sum of squared (blue squares)" / "total sum of squares (red squares)")

Predicting the mean is leading the residual sum of squared to be equal to the total sum of squares and thus R2 = 1 - 1 = 0. No error leads to no residual sum of squared and thus R2 = 1. You can get a negative score if the residuals is larger than the variance of the data.