Hello,
In Encoding of categorical variables notebook, in Handling categorical data section, the last sentence of the Identify categorical variables paragraph is unclear because of the use of two “because”:
Because in this notebook we will use
"education"
because it represents the original data.
I guess the first one should be dropped.
In Encoding ordinal categories paragraph, I think there is a missing ‘s’:
However, be careful when applying this encoding strategy: using this integer representation leads downstream predictive models to assume that the values are ordered (0 < 1 < 2 < 3… for instance).
Same thing in following Encoding nominal categories paragraph, plus a missing ‘n’ in “downstream”:
OneHotEncoder
is an alternative encoder that prevents the downstream models to make a false assumption about the ordering of categories.
In some following code cell,
print(f"The dataset encoded contains {data_encoded.shape[1]} features")
should be
print(f"The encoded dataset contains {data_encoded.shape[1]} features")
In Evaluate our predictive pipeline paragraph:
Shouldn’t
We see that the
Holand-Netherlands
category is occuring rarely.
be
We see that the
Holand-Netherlands
category is rarely occurring.
A little later:
In scikit-learn, there is two solutions to bypass this issue:
should be
In scikit-learn, there are two solutions to bypass this issue:
In the note following the creation of the pipeline:
Here, we need to increase the number of maximum iterations to obtain a fully converged LogisticRegression
And, in same note,
Contrary to numerical features, the one-hot encoded categorical features do not suffer from large variations and therefore increasing max_iter is the right thing to do.