Why the Culmen Depth threshold is in second place in the array given in argument to tree.predict or tree.predict_proba?

echidne · 21 June 2021 14:08

Hi,
If I have well understood the lesson " Build a classification decision tree" when you used tree.predict([[a,b]]) or tree.predict_proba([[a,b]]), a is a given value of the “Culmen Length” serie used as a treeshold and b is a given value of the “Culmen Depth” serie used as treeshold.
My question is why? Why is it not the contrary? Will the most important feature used for the decision always be the last in the array?

glemaitre58 · 21 June 2021 16:35

I think there is some confusion here.

First, let’s have a look at the dataset:

data, target = penguins[culmen_columns], penguins[target_column]

where culmen_columns = ["Culmen Length (mm)", "Culmen Depth (mm)"].
It means that the first column will be the length and the second column will be the depth.

We will call them something like tree.fit(data, target). This will create a tree with different thresholds learn on both features.

When calling tree.predict([[a,b]]), a and b are respectively a culmen length and depth value. They are ordered in the same order as during fit. The first column is the length and the second is the depth. This order will not change even if the feature are more or less important.

echidne · 21 June 2021 17:26

My bad: I did test to inverse the ordre of the features in culmen_columns but when I typed :

tree.predict([[0,15]])

I still had array(['Gentoo'], dtype=object) as ouput. So I thought b was always Culmen Depth

But I did not test enough since

tree.predict([[17,0]])

give as result : array(['Adelie'], dtype=object)

So sorry I should have tested more before to post

glemaitre58 · 21 June 2021 17:36

No problem