.transform and .fit_transform return ndarray instead of DataFrame

andrewjohnlowe · 8 May 2022 18:48

It felt very counterintuitive to me that the .transform and .fit_transform methods take a pandas DataFrame as an argument but returns a ndarray, when naively I would have expected a pandas DataFrame to be returned with the numeric values scaled. There’s this extra step of converting the the ndarray back to a DataFrame and naming the columns of the new DataFrame with the column names extracted from the old DataFrame, which feels clunky, and the necessity of this step isn’t explained. This step isn’t required when using a pipeline, so can I assume that the former strategy (i.e., calling .transform or .fit_transform and then converting the resultant ndarray to a DataFrame) isn’t something that you’d actually do normally, and that you’d nearly always use pipelines instead? Or, to put it another way, the former strategy is shown purely for didactic purposes and you’d almost never actually do this in practice?

ArturoAmorQ · 10 May 2022 13:18

Indeed, the use of arrays is to favor pipelines over dataframes, as the normal user case would involve at least simple pipelines that involve some sort of preprocessing.

glemaitre58 · 10 May 2022 18:48

Just to have a bit of history. scikit-learn is primary a package for machine learning where all optimizations are based on numerical methods. Thus, from the start, scikit-learn depended heavily on NumPy and SciPy.

For a couple of years now, scikit-learn moves toward integrating compatibility with more generic tools such as Pandas but the development is slow. Recently, there is work on handling columns name with the estimator method called get_feature_names_out to track the column name within a pipeline and ongoing work will be to support pandas in/out from estimators. However, those changes requires technical discussion to get the right API and this is the reason we have some Scikit-Learn Enhancement Proposal such as SLEP 014 Pandas in Pandas out by thomasjpfan · Pull Request #37 · scikit-learn/enhancement_proposals · GitHub

I hope this is giving a bit more light on the current state of the library and integration within the PyData ecosystem.