Boxplot interpretation

LucaPugliese · 25 March 2022 10:39

Is there any conclusion to be drawn by analyzing the box plot graph?
As an example AveOccup, Population and House Age have ‘stable’ coefficients across the 10 folds, while for AveBedrms there is maximum variability.
How should we interpret these results?

ArturoAmorQ · 25 March 2022 14:03

Sometimes a large variability can mean that there are correlated variables, as you will see later in this module when discussing regularization. Imagine that you have two variables x1 and x2 such that x1 = a * x2 (i.e. perfect correlation). Then a model trained on those two variables as if they were independent could either provide weight w to feature x1 and weight 0 to feature x2; or weight w/a to feature x2 and 0 to feature x1, or something in between. Cross-validating the model on data with noise will do the latter.

LucaPugliese · 25 March 2022 14:57

Trying to follow your suggestion for the problem at hand, should I interpret that AveBedrms and AveRooms have great variability because they are probably strongly related and that the same holds for Longitude and Latitude couple of variables?

ArturoAmorQ · 25 March 2022 15:20

I would say that AveBedrms and AveRooms might be correlated, yes. But the variability may have other reasons, for instance, the less informative a feature is, the more the model weights will depend on the subset of data used for training. Maybe that is the case for Longitude and Latitude up to a point. But take into account that noise is also an intrinsic source of variability.