Inquiry about machine learning

Alvin19 · 27 June 2021 08:10

Hi, so far I have learnt many concepts and programming. However, we have not learn about classification error and also the 4 type of positive/false error. Would this cover in Module 5: Decision tree?

Lastly, I would confirm the steps in building a machine learning in practice.

Obtain data and target column
Identify the numerical & categorical variable using selector
Build preprocessor & column transformer
Build model using pipeline
Split data into training & testing
Use cross validate
Obtain test score and best parameter

However, after step 7 are we going to get the best parameter then input to the algorithms we have used in our model in Step 4 to fit and predict then to get the classification error/regression error?

Any steps that I have miss out or having wrong concept?

glemaitre58 · 27 June 2021 14:32

An important step which is step 8, is to make an analysis of the best parameters across the outer cross-validation. At this stage, we have several choices:

one could, as you mentioned, refit a model with the best parameters if they were stable. However, in this case, you should be aware that the score that you obtain during the previous cross-validation might be not the most accurate estimate of the performance of your model, since you are refitting on the full dataset;
use all the best models from the outer cross-validation and make an ensemble of them without any refit. In this case, you expect the model to perform as good as what they were during the cross-validation.

Alvin19 · 27 June 2021 15:15

From your 2 suggestions.
Option 1 will have overfitting
Option 2 In making an ensemble model without refit, what is the refit means fit using the X_train and y_train?

Consider overfitting and if not building an ensemble model, which step should I get the prediction error (classification/regression)?

glemaitre58 · 27 June 2021 15:28

Why? There is no reason for the model to overfit more than with the models tried during cross-validation. However, if you want to evaluate the model, you need to have kept a test set somewhere.

The model was fitted on a train set in the inner cross-validation.

I am not sure what you mean by overfitting here. Basically, to get a prediction error, you need to make a new cross-validation.

Alvin19 · 27 June 2021 15:51

Oh, we need to hold out a test set if we use option 1.

Actually after applied cross validate (step 6), then we can use our model to fit and predict then calculate performance metrics (MAE, F1-Score, accuracy and etc). Is calculating test_score important because I do not see this in article regarding ML model building. Just to clarify.