Is there any conclusion to be drawn by analyzing the box plot graph?
As an example AveOccup, Population and House Age have ‘stable’ coefficients across the 10 folds, while for AveBedrms there is maximum variability.
How should we interpret these results?
Sometimes a large variability can mean that there are correlated variables, as you will see later in this module when discussing regularization. Imagine that you have two variables x1
and x2
such that x1 = a * x2
(i.e. perfect correlation). Then a model trained on those two variables as if they were independent could either provide weight w
to feature x1
and weight 0
to feature x2
; or weight w/a
to feature x2
and 0
to feature x1
, or something in between. Cross-validating the model on data with noise will do the latter.
Trying to follow your suggestion for the problem at hand, should I interpret that AveBedrms and AveRooms have great variability because they are probably strongly related and that the same holds for Longitude and Latitude couple of variables?
I would say that AveBedrms
and AveRooms
might be correlated, yes. But the variability may have other reasons, for instance, the less informative a feature is, the more the model weights will depend on the subset of data used for training. Maybe that is the case for Longitude
and Latitude
up to a point. But take into account that noise is also an intrinsic source of variability.