Q7 How to use LeaveOneGroupOut with cross_validate?

pbedo · 8 July 2021 14:29

Hello, I’m absolutely lost about how to use LeaveOneGroupOut as a cross validation strategy for cross_validate.

Here’s my code ending with an error:

# I assume the result of the previous question is not here for nothing
unique_ride_dates = np.unique(cycling.index.date)

# No idea of what I am doing here but it seems to match with the instructions
group, unique  = pd.factorize(unique_ride_dates)

# create cv
from sklearn.model_selection import LeaveOneGroupOut
cv = LeaveOneGroupOut()

# Reusing the previous model with the (proper ?) parameter for LeaveOneGroupOut()
cv_results_linear = cross_validate(
    linear_model , data, target, 
    cv=cv, groups=group,
    scoring='neg_mean_absolute_error',
    return_estimator=True,
    return_train_score=True
)

# And here's the error:
ValueError: Found input variables with inconsistent numbers of samples: [38254, 38254, 4]

I understand the error: my data and target have a length of 30k+ samples whereas the array group have a length of 4 ranging from 0 to 3. But that’s what the previous instructions were asking for (I think ?). So I guess my error stem from my misunderstanding about LOGO.

I have already check the documentation of LOGO but I wasn’t able to make sense of it: LOGO doesn’t take any paramter, it just have 2 methods that I tried to use but my code end with the same error anyway.
The exemple of the doc doesn’t features LOGO as a cross_validate strategy. The only instance were it does features as a cross_validate strat is in an exemple of the course where the value for the groups parameter (an array of 100+ dates featuring the year quarter) does not looks like the group length 4 array in Q7 which, as I understand, is required by the exercice instructions.

echidne · 8 July 2021 15:11

Hi pbedo,
the problem is that you are using first np.unique() that will sort the unique values so you are losing plenty of the data…
LeaveOneGroupOut() take one group out and do validation with the others but you need the length of groups = to the number of the row of the data.
With np.unique() you create an array of only 4 elements (the 4 dates of the rides) but you have a lot of more rows since you have several rides by day
get rid of that part and use directling cycling.index.date and that should work

pbedo · 9 July 2021 13:05

Thanks, no more errors. So groups should always gets a kind of categorical array ~~somehow matching the index of the data~~ ?

I was going to rant about the following paragraph:

create a variable called group that is a 1D numpy array containing the index of each ride present in the dataframe. Therefore, the length of group will be equal to the number of samples in data. If we had 2 bike rides, we would expect the indices 0 and 1 in group to differentiate the bike ride. You can use pd.factorize to encode any Python types into integer indices.

But I finally understood after reading it 10+ times in order to redact my rant:

group should reflect the ride each sample belong to (not len(group) = number of rides)
len(group) = total number of samples
to dos so: pd.factorize should assign a matching ride for each sample. So pd.factorize(unique_ride_dates) or even pd.factorize(cycling.index) were not going to work.

As I understand now, the parameter groups could get any categorical feature. In this exercise we are using dates but it could have been bike brand for exemple, right ?

So no rant, I grossly misunderstood the instructions. Thanks again for the feedback.