Typos, Phrasing & Other formal remarks (Module 1)

Hi, I open up this topic to discuss and report some things I noticed or found unclear in the course documents for the “Module 1. The Predictive Modeling Pipeline” section.

Feel free to participate or to contradict me!

Module 1.1


file: 01_tabular_data_exploration.ipynb

  • Note: Data is called tabular when it has a named column.

    This definition does not sound crystal clear to me. “A” named column? More like “at least” a named column, or even “named columns” (or “when it has the shape of a table”).

  • if you are young (less than 25 year-old roughly) or old (more than 70 year-old roughly) you tend to work less.

    We don’t see it very clearly since it is uneasy to appreciate the density of the points’ cloud, even if yes. We could also say that people with higher values are between 25 and 75 years (we see this more clearly).

  • ../figures/simple_decision_tree_adult_census.png is a little bit cropped on the bottom.

Quiz 02

Question 1. Saying that we can plot with pandas is a bit troublesome, as 1) it is not demonstrated in the notebook (and we cannot deduce this without external knowledge), 2) if it requires matplotlib (or seaborn in the same way?), can we really say it can? Maybe am I getting something wrong – it is also okay to learn things with quizzes but it seems to me a bit tricky here.

Module 1.2

Exercise M1.01

file: 02_numerical_pipeline_ex_01.ipynb

So 81% accuracy is significantly better than 76%

“Significantly” made me think about statistical signifiance but it’s obviously not the question here.

“Preprocessing for numerical features”

file: 02_numerical_pipeline_scaling.ipynb

  • let’s charge the full adult census dataset

    Isn’t “charge” a Gallicism? I would have say “load”.

  • the predictive performance (accuracy) slightly improved

    Well, true but it is not visible with tree significant digits (neither four, we need five to see a difference), for both models 0.807 is printed.

Quiz 03

Question 1. There is a “d)” but there is no “c)”.

Question 5.trained estimators” I don’t remember that the course precise that a fitted model can be called an “estimator” – for people unfamiliar with this notion, I may be unclear. It is called “estimator instance” in the SciKit Learn Glossary if I’m not mistaken.

Question 5 & 6.a)” is missing.


Module 1.3

The option drop="if_binary" is used in 03_categorical_pipeline.ipynb but explained only in the following notebook 03_categorical_pipeline_column_transformer.ipynb.

file: 03_categorical_pipeline.ipynb

set the parameter handle_unknown="ignore"

I would be clearer to specify we have to set the parameter for the OneHotEncoder() function. Otherwise it is not self-evident (one could think I would be up to cross_val_score() or LogisticRegression() to handle this).

file: 03_categorical_pipeline_ex_02.ipynb

Hint about using sparse=False in OneHotEncoder() is mentionned both on the beginning and the end of the file.

