Hi, I open up this topic to discuss and report some things I noticed or found unclear in the course documents for the “Module 1. The Predictive Modeling Pipeline” section.
Feel free to participate or to contradict me!
Hi, I open up this topic to discuss and report some things I noticed or found unclear in the course documents for the “Module 1. The Predictive Modeling Pipeline” section.
Feel free to participate or to contradict me!
file: 01_tabular_data_exploration.ipynb
Note: Data is called tabular when it has a named column.
This definition does not sound crystal clear to me. “A” named column? More like “at least” a named column, or even “named columns” (or “when it has the shape of a table”).
if you are young (less than 25 year-old roughly) or old (more than 70 year-old roughly) you tend to work less.
We don’t see it very clearly since it is uneasy to appreciate the density of the points’ cloud, even if yes. We could also say that people with higher values are between 25 and 75 years (we see this more clearly).
../figures/simple_decision_tree_adult_census.png
is a little bit cropped on the bottom.
Question 1. Saying that we can plot with pandas
is a bit troublesome, as 1) it is not demonstrated in the notebook (and we cannot deduce this without external knowledge), 2) if it requires matplotlib
(or seaborn
in the same way?), can we really say it can? Maybe am I getting something wrong – it is also okay to learn things with quizzes but it seems to me a bit tricky here.
file: 02_numerical_pipeline_ex_01.ipynb
So 81% accuracy is significantly better than 76%
“Significantly” made me think about statistical signifiance but it’s obviously not the question here.
file: 02_numerical_pipeline_scaling.ipynb
let’s charge the full adult census dataset
Isn’t “charge” a Gallicism? I would have say “load”.
the predictive performance (accuracy) slightly improved
Well, true but it is not visible with tree significant digits (neither four, we need five to see a difference), for both models 0.807
is printed.
Question 1. There is a “d)” but there is no “c)”.
Question 5. “trained estimators” I don’t remember that the course precise that a fitted model can be called an “estimator” – for people unfamiliar with this notion, I may be unclear. It is called “estimator instance” in the SciKit Learn Glossary if I’m not mistaken.
Question 5 & 6. “a)
” is missing.
Fixed
The option drop="if_binary"
is used in 03_categorical_pipeline.ipynb
but explained only in the following notebook 03_categorical_pipeline_column_transformer.ipynb
.
file: 03_categorical_pipeline.ipynb
set the parameter
handle_unknown="ignore"
I would be clearer to specify we have to set the parameter for the OneHotEncoder()
function. Otherwise it is not self-evident (one could think I would be up to cross_val_score()
or LogisticRegression()
to handle this).
file: 03_categorical_pipeline_ex_02.ipynb
Hint about using sparse=False
in OneHotEncoder()
is mentionned both on the beginning and the end of the file.
You are right. It is an inconsistency that has been introduced during our review process.
Since I proposed the following changes that might be helpful:
The data are stored in a pandas dataframe. A dataframe is type of structured
data composed of 2 dimensions. This type of data are also referred as tabular
data.
The rows represents a record. In the field of machine learning or descriptive
statistics, the terms commonly used to refer to rows are "sample",
"instance", or "observation".
The columns represents a type of information collected. In the field of
machined learning and descriptive statistics, the terms commonly used to
refer to columns are "feature", "variable", "attribute", or "covariate".
In some way, we introduce the first module by stating that knowledge in NumPy and Pandas are expected:
For Module 1.2, all those points need to be corrected. Thank you for pointing out.
A few typos in the first notebook (one image per comment, to avoid Discourse complaining):
Fixed
Fixed
Fixed latter
Not Fixed graph the problem seems to come from the png.
Two typos in the first exercice notebook:
Fixed
Fixed in FUN
One more:
Fixed
I have started to fix some of them and edit each post when I do (or when there is still something left to do on FUN).
For now this seems OK enough but we can certainly look more fancy Discourse feature to do it in the future (e.g. post splitting or something like this, i.e. take a post and move it to its own discussion)
Done in FUN