First model with scikit-learn

In this part, I thought the training and testing sets were not clearly defined. Perhaps it would be better to :

  • start with the same dataset as in the previous part (containing numerical and categorical features)
  • make the selection to keep only numerical features
  • make the first train and test model on the full dataset
  • explain the limits of testing the model on the trained data
  • split the original dataset in two parts => train set and test set
  • model again, and compare the results

Same thought with exercise M1.02, it could be more clear that we train the model with the first dataset, then test it on the second one.

The choice to have two different CSVs here with already extracted numerical features is for simplicity (we don’t have to select the numerical columns with pandas and we don’t have to talk about train_test_split).

I am going to say this is good enough and mark this as solved. The launch date is approaching and we have to make choices about what to prioritise :wink:.

Fair enough…I agree this is not a very pressing issue.