Tree split threshold

ldziej · 24 May 2021 00:15

Could you please help me with the criteria used by a decision tree split to determine a threshold, and in scikit-learn? Thank you so much.

glemaitre58 · 24 May 2021 08:00

The criterion used in the tree in scikit-learn is controlled by the parameter criterion (refer to the documentation for more detail: sklearn.tree.DecisionTreeClassifier — scikit-learn 0.24.2 documentation).

In classification, it can either be the entropy or gini. These measures computed on a partition (meaning a set of samples) will be minimum when a single class is present in the partition. Thus, we try to minimize this criterion.

In regression, we compute the mean error of the samples in the partition and the true target. This error could be the squared error, absolute error, etc.

Finally, to decide to whether pick a split, we have to combine the criterion value for the parent node and the node created by the split. It is known as the information gain and is more or less equivalent to the cirteria_parent = criteria_left - criteria_right. In reality, there are some additional normalization constant depending on the number of samples in the different portions.

@ldziej is it answering to your question?

JiongJiet · 5 June 2021 05:49

For regressuib criteria, median can also be used right?

glemaitre58 · 5 June 2021 09:34

Yes but it comes together with changing the criterion used. Instead of computing the mean squared error to find the best slit, the median absolute error is used and in this case, the terminal node (leaves) of the tree will output the median of the training samples at this node.