Q10 fit the model

Excuse me, I have tried to fit the model with the code:

cv = LeaveOneGroupOut()
train_indices, test_indices = list(cv.split(data, target, groups=groups))[0]

data_linear_model_train = data_linear_model.iloc[train_indices]
data_linear_model_test = data_linear_model.iloc[test_indices]

data_train = data.iloc[train_indices]
data_test = data.iloc[test_indices]

target_train = target.iloc[train_indices]
target_test = target.iloc[test_indices]

cv_results_linear_model = cross_validate(
    linear_model, data_linear_model_train, target_train, groups=groups, cv=cv,
    scoring="neg_mean_absolute_error", return_estimator=True,
    return_train_score=True, n_jobs=2)

cv_results_hgbdt = cross_validate(
    hgbdt, data_train, target_train, groups=groups, cv=cv,
    scoring="neg_mean_absolute_error", return_estimator=True,
    return_train_score=True, n_jobs=2)

but I obtained:

ValueError: Found input variables with inconsistent numbers of samples: [28032, 28032, 38254]

It is possible I have omited some information?
Thanks

Hi Idziej,
I think that for this question you have not use cross_validate but model.fit(...) then model.predict(...) where model is your linear model or your HGBT model.

You did not include the full traceback so we do not have the line number to know which line is raising the exception but here is what I suspect is happening:

  • In line 2:
train_indices, test_indices = list(cv.split(data, target, groups=groups))[0]

the groups variable is an array has size 38254 and after the cv split, the variable train_indices and test_indices have both 28032 and 10222 (38254 - 28032) elements respectively.

  • then in line 5 and 6:
data_linear_model_train = data_linear_model.iloc[train_indices]
data_linear_model_test = data_linear_model.iloc[test_indices]

the dataframes data_linear_model_train and data_linear_model_test have also 28032 and 10222 rows respectively.

  • similarly target_train and target_test have 28032 and 10222 elements respectively.

  • then the error is raised when calling:

cv_results_linear_model = cross_validate(
    linear_model, data_linear_model_train, target_train, groups=groups, cv=cv,
    scoring="neg_mean_absolute_error", return_estimator=True,
    return_train_score=True, n_jobs=2)

because groups still has 38254 elements while data_train and target_train have both 28032 rows and elements, hence the error message input variables with inconsistent numbers of samples: [28032, 28032, 38254].

You probably need to define groups_train and groups_test if you want to do nested cross-validation after the main train/test split. But is this really what you want to do here?

Hello, the full traceback

ValueError                                Traceback (most recent call last)
<ipython-input-11-fda564a52924> in <module>
----> 1 cv_results_linear_model = cross_validate(
      2     linear_model, data_linear_model_train, target_train, groups=groups, cv=cv,
      3     scoring="neg_mean_absolute_error", return_estimator=True,
      4     return_train_score=True, n_jobs=2)

/opt/conda/lib/python3.9/site-packages/sklearn/utils/validation.py in inner_f(*args, **kwargs)
     61             extra_args = len(args) - len(all_args)
     62             if extra_args <= 0:
---> 63                 return f(*args, **kwargs)
     64 
     65             # extra_args > 0

/opt/conda/lib/python3.9/site-packages/sklearn/model_selection/_validation.py in cross_validate(estimator, X, y, groups, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch, return_train_score, return_estimator, error_score)
    233 
    234     """
--> 235     X, y, groups = indexable(X, y, groups)
    236 
    237     cv = check_cv(cv, y, classifier=is_classifier(estimator))

/opt/conda/lib/python3.9/site-packages/sklearn/utils/validation.py in indexable(*iterables)
    354     """
    355     result = [_make_indexable(X) for X in iterables]
--> 356     check_consistent_length(*result)
    357     return result
    358 

/opt/conda/lib/python3.9/site-packages/sklearn/utils/validation.py in check_consistent_length(*arrays)
    317     uniques = np.unique(lengths)
    318     if len(uniques) > 1:
--> 319         raise ValueError("Found input variables with inconsistent numbers of"
    320                          " samples: %r" % [int(l) for l in lengths])
    321 

ValueError: Found input variables with inconsistent numbers of samples: [28032, 28032, 38254]

Goodmorning, also I have tried:

I noted the diferent prediction, is it ok? Thanks

So the error happens exactly where I predicted it would happen. So this confirms that my analysis above is probably valid.

I noted the diferent prediction, is it ok?

The output of hgdt.predict(data_test) is a a numpy array without the time-based index, while target_test is a pandas.Series with the time-based index. What matters are the values in the right column of the pandas series.

To convert target_test to a numpy array and ignore the time-based index, you can do target_test.to_numpy().

If you are not comfortable with the differences between numpy and pandas, I really encourage you to take the time to follow carefully the references that we put as pre-requisites of the mooc:

Hello, fortunately, although I could not reach to launch the cross validation, I reach to a result through model.fit() and model.predict() and I manage to draw the scatter plot (I have no problems with numpy and pandas):
import seaborn as sns
import seaborn as sns

ax = sns.scatterplot(x=target_test, y=linearpred,
color=“black”, alpha=0.5)
ax.set_title(“Predict vs Target”)

ax = sns.scatterplot(x=target_test, y=histpred,
color=“black”, alpha=0.5)
ax.set_title(“Predict vs Target”)

with result:


but I do not know how to include in the same plot the subset (time slice Q11), independently, the results:

So, my doubts are how I can integrate two plots in one sccaterplot, and in Q11 what it is mean smoother outputs (Q11)? Thanks

Hi Idziej,
Your results look quite similar to mine.
image
image

About the “smooth” output, it has been discussed here.
I think the question will be reformulated in next session.