Neg_mean_absolute_error in validation/learning curve

froxon · 23 May 2021 18:49

Hi!
When doing the exercise, I have tried to use:
results = learning_curve(
regressor, data, target, train_sizes=train_sizes, cv=cv, scoring=“neg_mean_absolute_error”, n_jobs=2)

This doesn’t work, and it returns vectors with all “nan” values. By eliminating the “scoring” parameter, it works.
I am sure this is a basic question, but I’d like to know why we cannot use this parameter.

Thank you for the response. I’m enjoying the course very much until now

glemaitre58 · 24 May 2021 07:48

Indeed, it is not straightforward. The call that you wrote is correct. So we will need the code creating each parameter that you pass in the learning_curve function to know what is the cause of the problem.

froxon · 24 May 2021 17:20

The only change is the “scoring” param. The following code gives all scores as “nan” unless we remove the line that says scoring=“neg_mean_absolute_error”.

import pandas as pd

blood_transfusion = pd.read_csv("../datasets/blood_transfusion.csv")
data = blood_transfusion.drop(columns="Class")
target = blood_transfusion["Class"]

from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline

escalador=StandardScaler()
predictor=SVC(kernel="rbf")
modelo=make_pipeline(escalador,predictor)

from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import cross_validate
import numpy as np

cv = ShuffleSplit(random_state=0)
cv_results=cross_validate(modelo,data,target,cv=cv,n_jobs=-1)

from sklearn.model_selection import validation_curve 

gama_de_gammas=np.logspace(-3,2,num=30)
train_scores, test_scores = validation_curve(
    modelo, data, target, param_name="svc__gamma", param_range=gama_de_gammas,
    scoring="neg_mean_absolute_error", ### THAT DOESN'T WORK
    cv=cv,n_jobs=-1)
train_scores, test_scores

glemaitre58 · 24 May 2021 17:40

The way to debug it is to pass error_score="raise" when calling cross_validate or validation_curve. It will raise the following error

ValueError                                Traceback (most recent call last)
<ipython-input-2-3f47846e2c6a> in <module>
     16 
     17 gama_de_gammas=np.logspace(-3,2,num=30)
---> 18 train_scores, test_scores = validation_curve(
     19     modelo, data, target, param_name="svc__gamma", param_range=gama_de_gammas,
     20     scoring="neg_mean_absolute_error", ### THAT DOESN'T WORK

~/Documents/packages/scikit-learn/sklearn/model_selection/_validation.py in validation_curve(estimator, X, y, param_name, param_range, groups, cv, scoring, n_jobs, pre_dispatch, verbose, error_score, fit_params)
   1640     parallel = Parallel(n_jobs=n_jobs, pre_dispatch=pre_dispatch,
   1641                         verbose=verbose)
-> 1642     results = parallel(delayed(_fit_and_score)(
   1643         clone(estimator), X, y, scorer, train, test, verbose,
   1644         parameters={param_name: v}, fit_params=fit_params,

~/miniconda3/envs/dev/lib/python3.8/site-packages/joblib/parallel.py in __call__(self, iterable)
   1040 
   1041             with self._backend.retrieval_context():
-> 1042                 self.retrieve()
   1043             # Make sure that we get a last message telling us we are done
   1044             elapsed_time = time.time() - self._start_time

~/miniconda3/envs/dev/lib/python3.8/site-packages/joblib/parallel.py in retrieve(self)
    919             try:
    920                 if getattr(self._backend, 'supports_timeout', False):
--> 921                     self._output.extend(job.get(timeout=self.timeout))
    922                 else:
    923                     self._output.extend(job.get())

~/miniconda3/envs/dev/lib/python3.8/site-packages/joblib/_parallel_backends.py in wrap_future_result(future, timeout)
    540         AsyncResults.get from multiprocessing."""
    541         try:
--> 542             return future.result(timeout=timeout)
    543         except CfTimeoutError as e:
    544             raise TimeoutError from e

~/miniconda3/envs/dev/lib/python3.8/concurrent/futures/_base.py in result(self, timeout)
    430                 raise CancelledError()
    431             elif self._state == FINISHED:
--> 432                 return self.__get_result()
    433 
    434             self._condition.wait(timeout)

~/miniconda3/envs/dev/lib/python3.8/concurrent/futures/_base.py in __get_result(self)
    386     def __get_result(self):
    387         if self._exception:
--> 388             raise self._exception
    389         else:
    390             return self._result

ValueError: could not convert string to float: 'not donated'

It might be a bug in scikit-learn, I will need to debug a bit more. However, you can bypass this error by encoding the target after loading the data.

from sklearn.preprocessing import LabelEncoder
...
target = LabelEncoder().fit_transform(target)

The code then will work.

glemaitre58 · 24 May 2021 18:06

Arff, I did not see it before. You are trying to use the mean absolute error that is a metric used in regression on a classification problem. The error message in scikit-learn is not informative but I should have been slightly more careful

Bottom line: you need to use a classification metric for a classification problem. You can check the available metric there: 3.3. Metrics and scoring: quantifying the quality of predictions — scikit-learn 0.24.2 documentation

froxon · 24 May 2021 19:38

Understood. Thank you very much!