Rephrasing and typos

IsabellePoirier · 11 April 2021 03:41

Hello,

In Encoding of categorical variables notebook, in Handling categorical data section, the last sentence of the Identify categorical variables paragraph is unclear because of the use of two “because”:

Because in this notebook we will use "education" because it represents the original data.

I guess the first one should be dropped.

In Encoding ordinal categories paragraph, I think there is a missing ‘s’:

However, be careful when applying this encoding strategy: using this integer representation leads downstream predictive models to assume that the values are ordered (0 < 1 < 2 < 3… for instance).

Same thing in following Encoding nominal categories paragraph, plus a missing ‘n’ in “downstream”:

OneHotEncoder is an alternative encoder that prevents the downstream models to make a false assumption about the ordering of categories.

In some following code cell,
print(f"The dataset encoded contains {data_encoded.shape[1]} features")
should be
print(f"The encoded dataset contains {data_encoded.shape[1]} features")

In Evaluate our predictive pipeline paragraph:
Shouldn’t

We see that the Holand-Netherlands category is occuring rarely.

be

We see that the Holand-Netherlands category is rarely occurring.

A little later:

In scikit-learn, there is two solutions to bypass this issue:

should be

In scikit-learn, there are two solutions to bypass this issue:

In the note following the creation of the pipeline:

Here, we need to increase the number of maximum iterations to obtain a fully converged LogisticRegression

And, in same note,

Contrary to numerical features, the one-hot encoded categorical features do not suffer from large variations and therefore increasing max_iter is the right thing to do.

lesteve · 14 April 2021 14:52

Fixed thanks!

lfarhi · 10 May 2021 15:39