Question 4 and 6

owen77s · 22 December 2022 12:39

Hello,

First question : I don’t understand why YearBuilt is considered as a numerical features, given that there is are finite number of year ?

Second question : why using more features allow our model to be more efficient ? Is there a limit on the number of the features used by the model for prediction ?

Owen

glemaitre · 3 January 2023 16:47

This one is a bit tricky. Choosing to treat “years” as categorical and numerical could come back to a modelling choice.

Here, I would say that “year” can be considered numerical the same way “temperature” could be. Temperature is bounded, at least what we can measure. What makes it for sure numerical is that we expect it to be a floating number but we could have integral measurement as well, and we would still consider it numerical.

For “years” we can consider that this is a measure of time. What we measure is usually bounded and we rounded. So this is a bit similar to temperature. But as I said, when it comes to modelling it with a predictive model we can potentially consider to model as a numerical value or a categorical value. The numerical value would be linked to a measure of time while the categorical approach would not represent such information.

You should probably define what you mean by efficient. I will suppose that you mean efficient in terms of generalization score.

Adding new features will induce the model to be more flexible. Indeed, it gets more information to “create new rules”. However, if you start to have too many features, the model will have too much flexibility and will “create rules” for noisy data points. This is what we call overfitting.

Those aspects are discussed in the second chapter.

bext_la · 10 January 2023 14:07

glemaitre:

This one is a bit tricky. Choosing to treat “years” as categorical and numerical could come back to a modelling choice.

Here, I would say that “year” can be considered numerical the same way “temperature” could be. Temperature is bounded, at least what we can measure. What makes it for sure numerical is that we expect it to be a floating number but we could have integral measurement as well, and we would still consider it numerical.

For “years” we can consider that this is a measure of time. What we measure is usually bounded and we rounded. So this is a bit similar to temperature. But as I said, when it comes to modelling it with a predictive model we can potentially consider to model as a numerical value or a categorical value. The numerical value would be linked to a measure of time while the categorical approach would not represent such information.

Hi Glemaitre,

why does not consider “OverallQual”, “OverallCond”. What is main differ?

glemaitre · 10 January 2023 16:03

I don’t have the dataset under the eye but if I recall correctly, “OverallCond” and “OveralQual” are a grade between 0 and 10. So they are categorical variables.