M7-Conclusion - Best Workflow

blooridian · 11 July 2021 14:13

Hi again,

I am done with this course… thanks a lot.

I find this last module very interesting but for beginners like me raises some basic question again…

We are focussing on the model evaluation by doing different CV schemes. Nevertheless at the end objective is to use model on fresh data so : what is best workflow to adopt

A - I perform model selection & tuning on the full data set. When I believe my model is ok and parameters defined (grdsearch, Cross validation) then I split dame full data in train / test, fit model on train and evaluate on test samples.

or

B) I split data on train / test set. Perform model selection & tuning on the train set (then it will be splitted again by CV & Gridsearch). When model is defined I take the best model and predict directly the test set for final evaluation? In this case do I need to retrain the model on full train set.? Or do I just take the best model and predict the test set directly? In this case it is still unclear how to use the “best_model” directly to predict test sample (function like GridsearchCV.best_model.predict(TestSet)…).?

Merci again !

ArturoAmorQ · 13 July 2021 08:43

Thank you for your kind words and sorry for taking so long to answer. I think the answer is closer to B):

Split data on train / test set
Perform model selection & tuning on a sub-sample of the training set (namely the validation set) which can be obtained with CV, for instance
Score your model on the test set
When trying to make predictions out of your present set, you may need to use your full set for training with the model that performed the best

I hope that makes things clearer!

FabioCLima · 13 July 2021 09:55

@ArturoAmorQ,

I am glad to see another student/professional has the same question as me. Can you provide us an example for the last part (“When trying to make predictions out of your present set, you may need to use your full set for training with the model that performed the best”). I’ve got your explanation, but this last part still trick for, in this kind of situation, I understand better by examples.

ArturoAmorQ · 13 July 2021 13:14

I am sorry if I was misleading. The last step applies to real life examples (beyond the didactic goal of the Mooc) when you actually deploy your model to forecast new data.

Example: You have a dataset of 1000 cancer patients and you want to build a classifier to find the features that are more relevant to predicting if a tumor is benign or malignant. Then the steps to follow are:

Split data on train (let’s say 800 samples) and test set (200 samples)
Perform model selection & tuning on a sub-sample of the training set (ignore test set for the moment). Using 10-Fold CV that means dividing your train set in 10 subsets of 80 samples to score your parameters.
Score your model with the selected parameters on the test set. Change classifier if needed (Try KNN and SVC, for instance) and repeat step number 2.
Once you have found the best performing model, you will want to train it again but using the whole 1000 samples to be able to determine if a new patient that arrives to the hospital (would be the 1001 data point) is more likely to have a malignant tumor or not.

Remember that the more data you have, the more accurate your predictions will be. Then for small and mid size datasets you will rather not waste those 200 data points used just for scoring, as much as you would not put away the validation set at the end of step 2.

Let me know if I was clear enough this time!

FabioCLima · 13 July 2021 14:05

Hey man,
Much clear this time, and there is more one question regards to, sorry to ask so many question in the subject. I didn’t get any clear explanation like this before, anyways, this is for real life as you mentioned, when you say score your model on the test set, it is use cross-validate or cross-score or any score metrics available on sklearn, right ?

I am so mad, because it is not that difficult to grasp, indeed, most of lectures provide by some trainers at least over here in Brasil it isn’t clear enough.

By the way, in this way it is much clear to grasp the concept and application of nested cross validation.

Thanks again for you time.

ArturoAmorQ · 13 July 2021 14:56

Yes, any score metrics that may suit the problem.