First model with scikit-learn

pasquet_syl · 22 April 2021 08:17

In this part, I thought the training and testing sets were not clearly defined. Perhaps it would be better to :

start with the same dataset as in the previous part (containing numerical and categorical features)
make the selection to keep only numerical features
make the first train and test model on the full dataset
explain the limits of testing the model on the trained data
split the original dataset in two parts => train set and test set
model again, and compare the results

pasquet_syl · 22 April 2021 08:54

Same thought with exercise M1.02, it could be more clear that we train the model with the first dataset, then test it on the second one.

lesteve · 29 April 2021 13:33

The choice to have two different CSVs here with already extracted numerical features is for simplicity (we don’t have to select the numerical columns with pandas and we don’t have to talk about train_test_split).

I am going to say this is good enough and mark this as solved. The launch date is approaching and we have to make choices about what to prioritise .

pasquet_syl · 29 April 2021 13:47

Fair enough…I agree this is not a very pressing issue.

lfarhi · 10 May 2021 15:39