Is the code below in-line with the “philosophy” of scikit-learn? I am planning to invest some time in getting better at seaborn/matplotlib in the future. I realize StandardScaler isn’t required here due to non-parametric nature of tree models, but I’m trying to practice the general pattern. Another question I have is, based on your experience, is it better to start with the end (i.e. if you are doing cross validation, build out cross validation first, then build out model, then build out preprocessors, etc., or do you start on the other side?) I appreciate your feedback!
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.compose import make_column_selector
import numpy as np
from sklearn.metrics import mean_absolute_error
numerical_column_selector = make_column_selector(dtype_include = np.number)
numerical_preprocessor = StandardScaler()
preprocessor = ColumnTransformer([("numerical_preprocessor", numerical_preprocessor, numerical_column_selector), ], n_jobs=-1, verbose=True, )
regressor = RandomForestRegressor(n_estimators=3)
model = make_pipeline(preprocessor, regressor)
model.fit(data_train, target_train)
y_pred = model.predict(data_test)
y_true = target_test.copy()
mean_absolute_error(y_true=y_true, y_pred=y_pred)
start_synthetic_range, end_synthetic_range = 170, 230
synthetic_test = pd.DataFrame(np.arange(start_synthetic_range, end_synthetic_range + 1), columns=data_train.columns)
import matplotlib.pyplot as plt
merged_train = pd.merge(data_train, target_train, left_index=True, right_index=True)
import seaborn as sns
ax = sns.scatterplot(data=merged_train, x=feature_name, y=target_name, color="blue", alpha=0.5 )
ax.set_title(f"{target_name} as a function of {feature_name}")
list_of_estimators = [estimator for estimator in model[-1].estimators_]
list_of_predictions = [estimator.predict(model[-2].transform(synthetic_test)) for estimator in list_of_estimators]
synthetic_predictions_df = pd.DataFrame(np.array(list_of_predictions).T, columns=['pred1', 'pred2', 'pred3'])
synthetic_predictions_df[feature_name] = np.arange(start_synthetic_range, end_synthetic_range + 1)
ax = sns.scatterplot(data=merged_train, x=feature_name, y=target_name, color="blue", alpha=0.5, legend=True )
sns.lineplot(data=synthetic_predictions_df, x=feature_name, y='pred1', color="red", alpha=0.5, ax=ax )
sns.lineplot(data=synthetic_predictions_df, x=feature_name, y='pred2', color="green", alpha=0.5, ax=ax )
sns.lineplot(data=synthetic_predictions_df, x=feature_name, y='pred3', color="purple", alpha=0.5, ax=ax, legend=True )
ax.set_title(f"{target_name} as a function of {feature_name}")