How to access to each individual tree of forest in a pipeline?

Jean_CC · 24 June 2021 19:58

It is easy to understand the notion of the following snippet of the course:

tree_predictions = []
for inx, tree in enumerate(rf.estimators_): #'rf' is a random forest model
    tree_predictions.append(tree.predict(data_test))

However, given

rf = RandomForestRegressor(n_jobs=2, random_state=0)
rf_model = make_pipeline(preprocessor, rf)
#Note: 'preprocessor' is a pipeline including: specifying categorical/numerical data columns, imputing missing data, and transforming categorical/numerical columns. 

grid_search = GridSearchCV(
    rf_model, param_grid=param_grid,
    scoring="neg_mean_absolute_error", n_jobs=-1,
)
_ = grid_search.fit(data_train, target_train)

How to get predictions for each tree given a set of new data?

glemaitre58 · 24 June 2021 21:51

So in the grid-search, you can access the best estimator using best_estimator_:

grid_search.best_estimator_

This estimator will be a pipeline where the last stage is a random forest. The trees are stored in a Python list in the fitted attribute estimators_. So you can access trees as:

i = 1
for tree in grid_search.best_estimator_[-1].estimators_:
    print(f"Tree #{i}: {tree}")
    i += 1

Jean_CC · 25 June 2021 02:12

Nickel! Many thanks!

In addition, to make predictions for given test data based on each tree, I think it’s necessary to re-fit the model using training data and then use ‘.predict()’. Please comment this.

e.g.,

tree_predictions = []
for idx, tree in enumerate(grid_search.best_estimator_[-1].estimators_):
    #print(tree)
    transformer = make_pipeline(preprocessor, tree)
    _ = transformer.fit(data_train, target_train)
    tree_predictions.append(transformer.predict(data_test))

glemaitre58 · 25 June 2021 09:39

best_estimator_ was already fitted on training data and thus the different trees on their respective bootstrap sample. Hence, you should not need to refit the trees.

However, I am having trouble to detect your usecase or the question that you try to answer.

Jean_CC · 25 June 2021 13:34

I was thinking so.

The objective of this coding is to make predictions for a test data set based on each tree.

I was asking whether it is necessary to re-fit the data, because when I run:

tree_predictions = []
for idx, tree in enumerate(grid_search.best_estimator_[-1].estimators_):
transformer = make_pipeline(preprocessor, tree)
tree_predictions.append(transformer.predict(data_test))

I got an error:

“ValueError: X has 82 features, but DecisionTreeRegressor is expecting 80 features as input.”

This means the training data used for the tree have less features than the total usable features in the course of CV?
Note that I used “train_test_split()” function to partition the entire data.

glemaitre58 · 25 June 2021 13:55

It means that at predict time there are 82 columns while only 80 were provided during fit. So the number of columns given by the ColumnTransfomer is not consistent.