Error while using train_test_split

“ValueError: Found input variables with inconsistent numbers of samples: [4, 48842]”

The above error is being shown while using train_test_split on the digits data.

How can this be corrected?

Could you provide the code snippet that the error? It should contain how you load the data and how you intend to split the array.

The function is currently complaining because there are not the same number of samples in the arrays provided. Assuming that you want to split X and y, we should have X.shape[0] == y.shape[0].

The code provided in the slides is incomplete for the sake of simplicity.
It does not define the target that is passed to the train_test_split.

If you want to reproduce the example from the slides do

import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import load_digits

digits = load_digits()
data = digits.images[30:70].reshape((4, 10, -1))
target = digits.target[30:70]

fig, ax = plt.subplots(5, 10, figsize=(18, 8),
                      subplot_kw=dict(xticks=[], yticks=[])
                      )

for j in range(10):
    ax[4, j].set_visible(False)
    for i in range(4):
        im = ax[i, j].imshow(data[i, j].reshape((8, 8)),
                             cmap=plt.cm.binary,
                             interpolation='nearest')
        im.set_clim(0, 16)

plt.show()

to create the image. Afterwards

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline

data = digits.images[30:70]
n_samples = len(data)
data = data.reshape((n_samples, -1))

X_train, X_test, y_train, y_test = train_test_split(
    data, target, test_size=0.2, shuffle=False)

model = make_pipeline(StandardScaler(), 
                      LogisticRegression())
model.fit(X_train, y_train)
accuracy = model.score(X_test, y_test)

print("The test accuracy is "
      f"{accuracy:.3f}")

to evaluate the model in the test set.

1 Like

As you see, it makes a lot of code to be displayed in the slides!

Thanks for the whole code!!