Q4 Cat and num preprocessor. Which goes first?

peguerosdc · 5 July 2021 01:13

Hi! Solving Q4 of wrap-up quiz 5, I noticed that I still don’t understand very well how to use Column Transformers.

I found out that it’s not the same to create preprocessor A and preprocessor B (please see the image below) as they both yield different results (A results in 0.74 and B in 0.72).

When displaying the diagrams of the models, transformers appear to be on the same level so I thought which one goes first wouldn’t make any difference as the numerical preprocessor would apply the selector to the whole input and do its thing (same as categorical preprocessor), but now I am wondering if I misunderstood this part and the output of the first pre processor is then processed by the second one (which is not something I would expect considering there is no line on the diagrams directing the output of one processor to the input of the other).

Can someone help me understand this? I checked the docs of ColumnTransformer but couldn’t find any information about hierarchy/order.

Thanks!

glemaitre58 · 5 July 2021 10:37

I think that we have a similar post in the forum that exactly discusses this topic. I am quite lazy to search myself so I will answer here shortly but don’t hesitate to search in the forum.

The difference in mean score is just random fluctuation. Basically, the model being different (only swapping categorical and numerical processing), the dataset will be split differently and you will obtain a different score. Indeed, this variation is not important in practice. It just informs you that you should not consider a model better than another just with a fluctuation of 0.2.

However, to answer the question of the quiz, I think that we stipulate the order of the ColumnTransformer such that you don’t get into this trouble to answer. We plan to put some care in the next session on this type of fluctuation that might change the answer just due to some randomness (split, algorithms, etc.).