Feedback on code

Is the code below in-line with the “philosophy” of scikit-learn? I am planning to invest some time in getting better at seaborn/matplotlib in the future. I realize StandardScaler isn’t required here due to non-parametric nature of tree models, but I’m trying to practice the general pattern. Another question I have is, based on your experience, is it better to start with the end (i.e. if you are doing cross validation, build out cross validation first, then build out model, then build out preprocessors, etc., or do you start on the other side?) I appreciate your feedback!

from sklearn.ensemble import RandomForestRegressor                                                                                                     
from sklearn.pipeline import make_pipeline                                                                                                             
from sklearn.compose import ColumnTransformer                                                                                                          
from sklearn.preprocessing import StandardScaler                                                                                                       
from sklearn.compose import make_column_selector                                                                                                       
import numpy as np                                                                                                                                     
from sklearn.metrics import mean_absolute_error                                                                                                        
numerical_column_selector = make_column_selector(dtype_include = np.number)                                                                            
numerical_preprocessor = StandardScaler()                                                                                                              
preprocessor = ColumnTransformer([("numerical_preprocessor", numerical_preprocessor, numerical_column_selector), ], n_jobs=-1, verbose=True, )         
regressor = RandomForestRegressor(n_estimators=3)                                                                                                      
model = make_pipeline(preprocessor, regressor)                                                                                                         
model.fit(data_train, target_train)                                                                                                                    
y_pred = model.predict(data_test)                                                                                                                      
y_true = target_test.copy()                                                                                                                            
mean_absolute_error(y_true=y_true, y_pred=y_pred)   

start_synthetic_range, end_synthetic_range = 170, 230                                                                                                  
synthetic_test = pd.DataFrame(np.arange(start_synthetic_range, end_synthetic_range + 1), columns=data_train.columns)                                   
import matplotlib.pyplot as plt                                                                                                                        
merged_train = pd.merge(data_train, target_train, left_index=True, right_index=True)                                                                   
import seaborn as sns                                                                                                                                  
ax = sns.scatterplot(data=merged_train, x=feature_name, y=target_name, color="blue", alpha=0.5 )                                                       
ax.set_title(f"{target_name} as a function of {feature_name}")                                                                                         
list_of_estimators = [estimator for estimator in model[-1].estimators_]                                                                                
list_of_predictions = [estimator.predict(model[-2].transform(synthetic_test)) for estimator in list_of_estimators]                                     
synthetic_predictions_df = pd.DataFrame(np.array(list_of_predictions).T, columns=['pred1', 'pred2', 'pred3'])                                          
synthetic_predictions_df[feature_name] = np.arange(start_synthetic_range, end_synthetic_range + 1)       
ax = sns.scatterplot(data=merged_train, x=feature_name, y=target_name, color="blue", alpha=0.5, legend=True )                                          
sns.lineplot(data=synthetic_predictions_df, x=feature_name, y='pred1', color="red", alpha=0.5, ax=ax )                                                 
sns.lineplot(data=synthetic_predictions_df, x=feature_name, y='pred2', color="green", alpha=0.5, ax=ax )                                               
sns.lineplot(data=synthetic_predictions_df, x=feature_name, y='pred3', color="purple", alpha=0.5, ax=ax, legend=True )                                 
ax.set_title(f"{target_name} as a function of {feature_name}")      

I would say the standard workflow is:

  1. Data preprocessing (may include cleaning the dataset, scaling, imputing missing values, encoding, etc).
  2. Define your model/pipeline. This may include step number 1 when using pipelines.
  3. Perform hyperparameter tuning on a sub-sample of the dataset (namely the validation set) which can be obtained with nested cross-validation, for instance. If you find your parameters are not stable across folds, you may want to revisit step number 1. and look for correlated variables, change imputing strategies, etc.
  4. Score your model on the test set. Change model if needed (try linear and non-linear models, for instance) and repeat step number 3.
  5. Plot the final results.

A side suggestion is to keep your code lines below 80 characters to improve readability :wink:

2 Likes

Thank you Arturo! I will start using autopep8 before posting code. A question related to this code I had is when we say pipeline[-2], do we get a copy of the transformer, or does it refer to the same transformer via reference? -Pritam

It refers to the same transformer. In your example pipeline[-2] is equivalent to the trained ColumnTransformer, meaning that you can access it’s attributes with the notation pipeline[-2].transformers_. The emphasis on the word trained here is because you can cross-validate a pipeline and therefore, each transformer will fit a different subset of data in each fold. Said attributes of each trained transformer will remain available for inspection, for example, StandardScaler will yield a different mean_ in each fold.

In your case, having the ColumnTransformer with only the StandardScaler introduces an unnecessary step that makes accessing the latter’s attributes more difficult.