For question 5, depending on how the data is processed, we can fall out of the given range. Here is the code that I used :
Full code snippet
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OrdinalEncoder
from sklearn.compose import make_column_selector as selector
from sklearn.impute import SimpleImputer
numerical_columns_selector = selector(dtype_exclude=object)
categorical_columns_selector = selector(dtype_include=object)
numerical_columns = numerical_columns_selector(data)
categorical_columns = categorical_columns_selector(data)
categorical_preprocessor = make_pipeline(SimpleImputer(strategy="most_frequent"),OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=-1))
#categorical_preprocessor = make_pipeline(SimpleImputer(strategy="constant",fill_value="unknown"),OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=-1))
#categorical_preprocessor = make_pipeline(OneHotEncoder(handle_unknown="ignore"))
numerical_preprocessor = SimpleImputer()
preprocessor = ColumnTransformer([
('OrdinalEncoder', categorical_preprocessor, categorical_columns),
('standard-scaler', numerical_preprocessor, numerical_columns)])
model = make_pipeline(preprocessor, DecisionTreeRegressor(random_state=0))
cv_results = cross_validate(model, data, target,scoring="r2",return_train_score=True,
return_estimator=True, cv=10, n_jobs=2,error_score="raise")
scores = cv_results["test_score"]
print("The mean cross-validation accuracy is: "
f"{scores.mean():.3f} +/- {scores.std():.3f}")
The first line gives
categorical_preprocessor = make_pipeline(SimpleImputer(strategy="most_frequent"),OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=-1))
0.756
The second line gives
categorical_preprocessor = make_pipeline(SimpleImputer(strategy="constant",fill_value="unknown"),OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=-1))
0.742
And the third one gives
categorical_preprocessor = make_pipeline(OneHotEncoder(handle_unknown="ignore"))
0.726
What I did was do the first solution (0.76), then see that the answer was not covered, then fell back to the hot encoder strategy (which is wrong as I saw afterwards reading the forum) but as the result corresponded, I gave that answer (0.72) which was wrong. I thought of the middle answer only afterwards.
Edit :
data.info()
helps to see that the "most_frequent " strategy was probably not the best one. Maybe its use should be underlined as early as module 1 in âtabular data explorationâ. to detect early how many values are missing.