Q10 fit the model

ldziej · 9 July 2021 22:20

Excuse me, I have tried to fit the model with the code:

cv = LeaveOneGroupOut()
train_indices, test_indices = list(cv.split(data, target, groups=groups))[0]

data_linear_model_train = data_linear_model.iloc[train_indices]
data_linear_model_test = data_linear_model.iloc[test_indices]

data_train = data.iloc[train_indices]
data_test = data.iloc[test_indices]

target_train = target.iloc[train_indices]
target_test = target.iloc[test_indices]

cv_results_linear_model = cross_validate(
    linear_model, data_linear_model_train, target_train, groups=groups, cv=cv,
    scoring="neg_mean_absolute_error", return_estimator=True,
    return_train_score=True, n_jobs=2)

cv_results_hgbdt = cross_validate(
    hgbdt, data_train, target_train, groups=groups, cv=cv,
    scoring="neg_mean_absolute_error", return_estimator=True,
    return_train_score=True, n_jobs=2)

but I obtained:

ValueError: Found input variables with inconsistent numbers of samples: [28032, 28032, 38254]

It is possible I have omited some information?
Thanks

echidne · 9 July 2021 23:23

Hi Idziej,
I think that for this question you have not use cross_validate but model.fit(...) then model.predict(...) where model is your linear model or your HGBT model.

ogrisel · 10 July 2021 10:26

You did not include the full traceback so we do not have the line number to know which line is raising the exception but here is what I suspect is happening:

In line 2:

train_indices, test_indices = list(cv.split(data, target, groups=groups))[0]

the groups variable is an array has size 38254 and after the cv split, the variable train_indices and test_indices have both 28032 and 10222 (38254 - 28032) elements respectively.

then in line 5 and 6:

data_linear_model_train = data_linear_model.iloc[train_indices]
data_linear_model_test = data_linear_model.iloc[test_indices]

the dataframes data_linear_model_train and data_linear_model_test have also 28032 and 10222 rows respectively.

similarly target_train and target_test have 28032 and 10222 elements respectively.
then the error is raised when calling:

cv_results_linear_model = cross_validate(
    linear_model, data_linear_model_train, target_train, groups=groups, cv=cv,
    scoring="neg_mean_absolute_error", return_estimator=True,
    return_train_score=True, n_jobs=2)

because groups still has 38254 elements while data_train and target_train have both 28032 rows and elements, hence the error message input variables with inconsistent numbers of samples: [28032, 28032, 38254].

You probably need to define groups_train and groups_test if you want to do nested cross-validation after the main train/test split. But is this really what you want to do here?

ldziej · 10 July 2021 11:31

Hello, the full traceback

ValueError                                Traceback (most recent call last)
<ipython-input-11-fda564a52924> in <module>
----> 1 cv_results_linear_model = cross_validate(
      2     linear_model, data_linear_model_train, target_train, groups=groups, cv=cv,
      3     scoring="neg_mean_absolute_error", return_estimator=True,
      4     return_train_score=True, n_jobs=2)

/opt/conda/lib/python3.9/site-packages/sklearn/utils/validation.py in inner_f(*args, **kwargs)
     61             extra_args = len(args) - len(all_args)
     62             if extra_args <= 0:
---> 63                 return f(*args, **kwargs)
     64 
     65             # extra_args > 0

/opt/conda/lib/python3.9/site-packages/sklearn/model_selection/_validation.py in cross_validate(estimator, X, y, groups, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch, return_train_score, return_estimator, error_score)
    233 
    234     """
--> 235     X, y, groups = indexable(X, y, groups)
    236 
    237     cv = check_cv(cv, y, classifier=is_classifier(estimator))

/opt/conda/lib/python3.9/site-packages/sklearn/utils/validation.py in indexable(*iterables)
    354     """
    355     result = [_make_indexable(X) for X in iterables]
--> 356     check_consistent_length(*result)
    357     return result
    358 

/opt/conda/lib/python3.9/site-packages/sklearn/utils/validation.py in check_consistent_length(*arrays)
    317     uniques = np.unique(lengths)
    318     if len(uniques) > 1:
--> 319         raise ValueError("Found input variables with inconsistent numbers of"
    320                          " samples: %r" % [int(l) for l in lengths])
    321 

ValueError: Found input variables with inconsistent numbers of samples: [28032, 28032, 38254]

ldziej · 10 July 2021 12:02

Goodmorning, also I have tried:

I noted the diferent prediction, is it ok? Thanks

ogrisel · 11 July 2021 11:41

So the error happens exactly where I predicted it would happen. So this confirms that my analysis above is probably valid.

ogrisel · 11 July 2021 11:44

I noted the diferent prediction, is it ok?

The output of hgdt.predict(data_test) is a a numpy array without the time-based index, while target_test is a pandas.Series with the time-based index. What matters are the values in the right column of the pandas series.

To convert target_test to a numpy array and ignore the time-based index, you can do target_test.to_numpy().

If you are not comfortable with the differences between numpy and pandas, I really encourage you to take the time to follow carefully the references that we put as pre-requisites of the mooc:

ldziej · 11 July 2021 12:28

Hello, fortunately, although I could not reach to launch the cross validation, I reach to a result through model.fit() and model.predict() and I manage to draw the scatter plot (I have no problems with numpy and pandas):
import seaborn as sns
import seaborn as sns

ax = sns.scatterplot(x=target_test, y=linearpred,
color=“black”, alpha=0.5)
ax.set_title(“Predict vs Target”)

ax = sns.scatterplot(x=target_test, y=histpred,
color=“black”, alpha=0.5)
ax.set_title(“Predict vs Target”)

with result:

but I do not know how to include in the same plot the subset (time slice Q11), independently, the results:

So, my doubts are how I can integrate two plots in one sccaterplot, and in Q11 what it is mean smoother outputs (Q11)? Thanks

echidne · 11 July 2021 12:52

Hi Idziej,
Your results look quite similar to mine.

About the “smooth” output, it has been discussed here.
I think the question will be reformulated in next session.