Boxplot interpretation

Is there any conclusion to be drawn by analyzing the box plot graph?
As an example AveOccup, Population and House Age have ‘stable’ coefficients across the 10 folds, while for AveBedrms there is maximum variability.
How should we interpret these results?

1 Like

Sometimes a large variability can mean that there are correlated variables, as you will see later in this module when discussing regularization. Imagine that you have two variables x1 and x2 such that x1 = a * x2 (i.e. perfect correlation). Then a model trained on those two variables as if they were independent could either provide weight w to feature x1 and weight 0 to feature x2; or weight w/a to feature x2 and 0 to feature x1, or something in between. Cross-validating the model on data with noise will do the latter.

Trying to follow your suggestion for the problem at hand, should I interpret that AveBedrms and AveRooms have great variability because they are probably strongly related and that the same holds for Longitude and Latitude couple of variables?

I would say that AveBedrms and AveRooms might be correlated, yes. But the variability may have other reasons, for instance, the less informative a feature is, the more the model weights will depend on the subset of data used for training. Maybe that is the case for Longitude and Latitude up to a point. But take into account that noise is also an intrinsic source of variability.

2 Likes