Typos, Phrasing & Other formal remarks (Module 1)

Hi, I open up this topic to discuss and report some things I noticed or found unclear in the course documents for the “Module 1. The Predictive Modeling Pipeline” section.

Feel free to participate or to contradict me!

1 Like

Module 1.1

Notebook

file: 01_tabular_data_exploration.ipynb

  • Note: Data is called tabular when it has a named column.

    This definition does not sound crystal clear to me. “A” named column? More like “at least” a named column, or even “named columns” (or “when it has the shape of a table”).

  • if you are young (less than 25 year-old roughly) or old (more than 70 year-old roughly) you tend to work less.

    We don’t see it very clearly since it is uneasy to appreciate the density of the points’ cloud, even if yes. We could also say that people with higher values are between 25 and 75 years (we see this more clearly).

  • ../figures/simple_decision_tree_adult_census.png is a little bit cropped on the bottom.

Quiz 02

Question 1. Saying that we can plot with pandas is a bit troublesome, as 1) it is not demonstrated in the notebook (and we cannot deduce this without external knowledge), 2) if it requires matplotlib (or seaborn in the same way?), can we really say it can? Maybe am I getting something wrong – it is also okay to learn things with quizzes but it seems to me a bit tricky here.

Module 1.2

Exercise M1.01

file: 02_numerical_pipeline_ex_01.ipynb

So 81% accuracy is significantly better than 76%

“Significantly” made me think about statistical signifiance but it’s obviously not the question here.

“Preprocessing for numerical features”

file: 02_numerical_pipeline_scaling.ipynb

  • let’s charge the full adult census dataset

    Isn’t “charge” a Gallicism? I would have say “load”.

  • the predictive performance (accuracy) slightly improved

    Well, true but it is not visible with tree significant digits (neither four, we need five to see a difference), for both models 0.807 is printed.

Quiz 03

Question 1. There is a “d)” but there is no “c)”.

Question 5.trained estimators” I don’t remember that the course precise that a fitted model can be called an “estimator” – for people unfamiliar with this notion, I may be unclear. It is called “estimator instance” in the SciKit Learn Glossary if I’m not mistaken.

Question 5 & 6.a)” is missing.

Fixed

Module 1.3

The option drop="if_binary" is used in 03_categorical_pipeline.ipynb but explained only in the following notebook 03_categorical_pipeline_column_transformer.ipynb.


file: 03_categorical_pipeline.ipynb

set the parameter handle_unknown="ignore"

I would be clearer to specify we have to set the parameter for the OneHotEncoder() function. Otherwise it is not self-evident (one could think I would be up to cross_val_score() or LogisticRegression() to handle this).


file: 03_categorical_pipeline_ex_02.ipynb

Hint about using sparse=False in OneHotEncoder() is mentionned both on the beginning and the end of the file.

You are right. It is an inconsistency that has been introduced during our review process.
Since I proposed the following changes that might be helpful:

The data are stored in a pandas dataframe. A dataframe is type of structured
data composed of 2 dimensions. This type of data are also referred as tabular
data.

The rows represents a record. In the field of machine learning or descriptive
statistics, the terms commonly used to refer to rows are "sample",
"instance", or "observation".

The columns represents a type of information collected. In the field of
machined learning and descriptive statistics, the terms commonly used to
refer to columns are "feature", "variable", "attribute", or "covariate".

In some way, we introduce the first module by stating that knowledge in NumPy and Pandas are expected:

https://inria.github.io/scikit-learn-mooc/predictive_modeling_pipeline/predictive_modeling_module_intro.html#before-getting-started

For Module 1.2, all those points need to be corrected. Thank you for pointing out.

A few typos in the first notebook (one image per comment, to avoid Discourse complaining):

38

Fixed

05

Fixed

46

Fixed latter
Not Fixed graph the problem seems to come from the png.

Two typos in the first exercice notebook:

05

Fixed

05

Fixed in FUN

One more:
33

Fixed

A post was split to a new topic: Fix predictor.fit diagram

I have started to fix some of them and edit each post when I do (or when there is still something left to do on FUN).

For now this seems OK enough but we can certainly look more fancy Discourse feature to do it in the future (e.g. post splitting or something like this, i.e. take a post and move it to its own discussion)

1 Like

Done in FUN

All of this has been fixed. I moved the diagram fix to Fix predictor.fit diagram