Typos, Phrasing & Other formal remarks (Module 1)

TristanH · 17 February 2021 18:39

Hi, I open up this topic to discuss and report some things I noticed or found unclear in the course documents for the “Module 1. The Predictive Modeling Pipeline” section.

Feel free to participate or to contradict me!

TristanH · 17 February 2021 18:40

Module 1.1

Notebook

file: 01_tabular_data_exploration.ipynb

Note: Data is called tabular when it has a named column.

This definition does not sound crystal clear to me. “A” named column? More like “at least” a named column, or even “named columns” (or “when it has the shape of a table”).
if you are young (less than 25 year-old roughly) or old (more than 70 year-old roughly) you tend to work less.

We don’t see it very clearly since it is uneasy to appreciate the density of the points’ cloud, even if yes. We could also say that people with higher values are between 25 and 75 years (we see this more clearly).
../figures/simple_decision_tree_adult_census.png is a little bit cropped on the bottom.

Quiz 02

Question 1. Saying that we can plot with pandas is a bit troublesome, as 1) it is not demonstrated in the notebook (and we cannot deduce this without external knowledge), 2) if it requires matplotlib (or seaborn in the same way?), can we really say it can? Maybe am I getting something wrong – it is also okay to learn things with quizzes but it seems to me a bit tricky here.

TristanH · 17 February 2021 18:41

Module 1.2

Exercise M1.01

file: 02_numerical_pipeline_ex_01.ipynb

So 81% accuracy is significantly better than 76%

“Significantly” made me think about statistical signifiance but it’s obviously not the question here.

“Preprocessing for numerical features”

file: 02_numerical_pipeline_scaling.ipynb

let’s charge the full adult census dataset

Isn’t “charge” a Gallicism? I would have say “load”.
the predictive performance (accuracy) slightly improved

Well, true but it is not visible with tree significant digits (neither four, we need five to see a difference), for both models 0.807 is printed.

Quiz 03

Question 1. There is a “d)” but there is no “c)”.

Question 5. “trained estimators” I don’t remember that the course precise that a fitted model can be called an “estimator” – for people unfamiliar with this notion, I may be unclear. It is called “estimator instance” in the SciKit Learn Glossary if I’m not mistaken.

Question 5 & 6. “a)” is missing.

Fixed

TristanH · 17 February 2021 18:42

Module 1.3

The option drop="if_binary" is used in 03_categorical_pipeline.ipynb but explained only in the following notebook 03_categorical_pipeline_column_transformer.ipynb.

file: 03_categorical_pipeline.ipynb

set the parameter handle_unknown="ignore"

I would be clearer to specify we have to set the parameter for the OneHotEncoder() function. Otherwise it is not self-evident (one could think I would be up to cross_val_score() or LogisticRegression() to handle this).

file: 03_categorical_pipeline_ex_02.ipynb

Hint about using sparse=False in OneHotEncoder() is mentionned both on the beginning and the end of the file.

MarieCollin · 18 February 2021 08:10

glemaitre · 19 February 2021 10:46

You are right. It is an inconsistency that has been introduced during our review process.
Since I proposed the following changes that might be helpful:

The data are stored in a pandas dataframe. A dataframe is type of structured
data composed of 2 dimensions. This type of data are also referred as tabular
data.

The rows represents a record. In the field of machine learning or descriptive
statistics, the terms commonly used to refer to rows are "sample",
"instance", or "observation".

The columns represents a type of information collected. In the field of
machined learning and descriptive statistics, the terms commonly used to
refer to columns are "feature", "variable", "attribute", or "covariate".

glemaitre · 19 February 2021 10:54

In some way, we introduce the first module by stating that knowledge in NumPy and Pandas are expected:

https://inria.github.io/scikit-learn-mooc/predictive_modeling_pipeline/predictive_modeling_module_intro.html#before-getting-started

glemaitre · 19 February 2021 10:57

For Module 1.2, all those points need to be corrected. Thank you for pointing out.

khinsen · 8 April 2021 07:04

A few typos in the first notebook (one image per comment, to avoid Discourse complaining):

Fixed

khinsen · 8 April 2021 07:04

Fixed

khinsen · 8 April 2021 07:05

Fixed latter
Not Fixed graph the problem seems to come from the png.

khinsen · 8 April 2021 07:07

Two typos in the first exercice notebook:

Fixed

khinsen · 8 April 2021 07:08

Fixed in FUN

khinsen · 8 April 2021 14:55

One more:

Fixed

lesteve · 26 April 2021 08:35

A post was split to a new topic: Fix predictor.fit diagram

lesteve · 8 April 2021 16:37

I have started to fix some of them and edit each post when I do (or when there is still something left to do on FUN).

For now this seems OK enough but we can certainly look more fancy Discourse feature to do it in the future (e.g. post splitting or something like this, i.e. take a post and move it to its own discussion)

MarieCollin · 9 April 2021 12:51

Done in FUN

lesteve · 26 April 2021 08:40

All of this has been fixed. I moved the diagram fix to Fix predictor.fit diagram

lfarhi · 10 May 2021 16:29

lfarhi · 10 May 2021 16:29