Wrap-up quiz 1, Question 6, choosing numerical data using selector effect the score

tsvikagreener · 19 November 2022 17:42

Hi,
I have found something interesting in question 6.
I just compared between the scores of 2 models that are using only numerical data, one that uses numerical data chosen by selector and the other uses numerical data that came from the quiz instructions, 36 vs 24 columns respectively. If I use cv=5 then the mean scores are 0.92055 vs 0.8952 (the one that uses numerical data chosen by selector gives better score than the other that uses numerical data that comes from the quiz instructions).

Thus I think that perhaps (from the point of view of the algorithm / model) the 36-24=12 data columns (that were not considered as numerical by the quiz instructions) are probably having some meaningful numerical features rather than categorical features (in general, because I did not check individually each one of these 12 columns).

I have looked at “The Ames housing dataset” in the course appendix, but still did not understand why using subset of numerical features (24) and not using the full set of the numerical columns (36 by using selector), especially if it gives better accuracy as implied by its mean score (0.92055 vs 0.8952).

In addition, a models that are using combined data (categorical together and numerical by selector or by quiz instructions) give the same scores as the model that is using only numerical data that was chosen by selector (0.92055).

Could you please further explain this?
Thanks T.G.

ArturoAmorQ · 21 November 2022 10:27

Hi @tsvikagreener,

Maybe you can take a look at the discussion here.

But the short answer is that we selected that subset of features for reproducibility of the results. In more advance modules you will see that correlated features add variability to the scores and we don’t want that if we want all the students to get similar answers regardless of their setup.