Neg_mean_absolute_error in validation/learning curve

Hi!
When doing the exercise, I have tried to use:
results = learning_curve(
regressor, data, target, train_sizes=train_sizes, cv=cv, scoring=“neg_mean_absolute_error”, n_jobs=2)

This doesn’t work, and it returns vectors with all “nan” values. By eliminating the “scoring” parameter, it works.
I am sure this is a basic question, but I’d like to know why we cannot use this parameter.

Thank you for the response. I’m enjoying the course very much until now :wink:

Indeed, it is not straightforward. The call that you wrote is correct. So we will need the code creating each parameter that you pass in the learning_curve function to know what is the cause of the problem.

The only change is the “scoring” param. The following code gives all scores as “nan” unless we remove the line that says scoring=“neg_mean_absolute_error”.

import pandas as pd

blood_transfusion = pd.read_csv("../datasets/blood_transfusion.csv")
data = blood_transfusion.drop(columns="Class")
target = blood_transfusion["Class"]

from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline

escalador=StandardScaler()
predictor=SVC(kernel="rbf")
modelo=make_pipeline(escalador,predictor)

from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import cross_validate
import numpy as np

cv = ShuffleSplit(random_state=0)
cv_results=cross_validate(modelo,data,target,cv=cv,n_jobs=-1)

from sklearn.model_selection import validation_curve 

gama_de_gammas=np.logspace(-3,2,num=30)
train_scores, test_scores = validation_curve(
    modelo, data, target, param_name="svc__gamma", param_range=gama_de_gammas,
    scoring="neg_mean_absolute_error", ### THAT DOESN'T WORK
    cv=cv,n_jobs=-1)
train_scores, test_scores

The way to debug it is to pass error_score="raise" when calling cross_validate or validation_curve. It will raise the following error

ValueError                                Traceback (most recent call last)
<ipython-input-2-3f47846e2c6a> in <module>
     16 
     17 gama_de_gammas=np.logspace(-3,2,num=30)
---> 18 train_scores, test_scores = validation_curve(
     19     modelo, data, target, param_name="svc__gamma", param_range=gama_de_gammas,
     20     scoring="neg_mean_absolute_error", ### THAT DOESN'T WORK

~/Documents/packages/scikit-learn/sklearn/model_selection/_validation.py in validation_curve(estimator, X, y, param_name, param_range, groups, cv, scoring, n_jobs, pre_dispatch, verbose, error_score, fit_params)
   1640     parallel = Parallel(n_jobs=n_jobs, pre_dispatch=pre_dispatch,
   1641                         verbose=verbose)
-> 1642     results = parallel(delayed(_fit_and_score)(
   1643         clone(estimator), X, y, scorer, train, test, verbose,
   1644         parameters={param_name: v}, fit_params=fit_params,

~/miniconda3/envs/dev/lib/python3.8/site-packages/joblib/parallel.py in __call__(self, iterable)
   1040 
   1041             with self._backend.retrieval_context():
-> 1042                 self.retrieve()
   1043             # Make sure that we get a last message telling us we are done
   1044             elapsed_time = time.time() - self._start_time

~/miniconda3/envs/dev/lib/python3.8/site-packages/joblib/parallel.py in retrieve(self)
    919             try:
    920                 if getattr(self._backend, 'supports_timeout', False):
--> 921                     self._output.extend(job.get(timeout=self.timeout))
    922                 else:
    923                     self._output.extend(job.get())

~/miniconda3/envs/dev/lib/python3.8/site-packages/joblib/_parallel_backends.py in wrap_future_result(future, timeout)
    540         AsyncResults.get from multiprocessing."""
    541         try:
--> 542             return future.result(timeout=timeout)
    543         except CfTimeoutError as e:
    544             raise TimeoutError from e

~/miniconda3/envs/dev/lib/python3.8/concurrent/futures/_base.py in result(self, timeout)
    430                 raise CancelledError()
    431             elif self._state == FINISHED:
--> 432                 return self.__get_result()
    433 
    434             self._condition.wait(timeout)

~/miniconda3/envs/dev/lib/python3.8/concurrent/futures/_base.py in __get_result(self)
    386     def __get_result(self):
    387         if self._exception:
--> 388             raise self._exception
    389         else:
    390             return self._result

ValueError: could not convert string to float: 'not donated'

It might be a bug in scikit-learn, I will need to debug a bit more. However, you can bypass this error by encoding the target after loading the data.

from sklearn.preprocessing import LabelEncoder
...
target = LabelEncoder().fit_transform(target)

The code then will work.

Arff, I did not see it before. You are trying to use the mean absolute error that is a metric used in regression on a classification problem. The error message in scikit-learn is not informative but I should have been slightly more careful :slight_smile:

Bottom line: you need to use a classification metric for a classification problem. You can check the available metric there: 3.3. Metrics and scoring: quantifying the quality of predictions — scikit-learn 0.24.2 documentation

1 Like

Understood. Thank you very much!