Hi,
I have found something interesting in question 6.
I just compared between the scores of 2 models that are using only numerical data, one that uses numerical data chosen by selector and the other uses numerical data that came from the quiz instructions, 36 vs 24 columns respectively. If I use cv=5 then the mean scores are 0.92055 vs 0.8952 (the one that uses numerical data chosen by selector gives better score than the other that uses numerical data that comes from the quiz instructions).
Thus I think that perhaps (from the point of view of the algorithm / model) the 36-24=12 data columns (that were not considered as numerical by the quiz instructions) are probably having some meaningful numerical features rather than categorical features (in general, because I did not check individually each one of these 12 columns).
I have looked at “The Ames housing dataset” in the course appendix, but still did not understand why using subset of numerical features (24) and not using the full set of the numerical columns (36 by using selector), especially if it gives better accuracy as implied by its mean score (0.92055 vs 0.8952).
In addition, a models that are using combined data (categorical together and numerical by selector or by quiz instructions) give the same scores as the model that is using only numerical data that was chosen by selector (0.92055).
Could you please further explain this?
Thanks T.G.