Data selector

Abhaya_simha · 27 November 2022 06:37

What is the difference between data.columns.difference and column_selector commands.

ArturoAmorQ · 28 November 2022 10:30

The function dataframe.columns.difference() gives you complement of the values that you provide as argument. It can be used to create a new dataframe from an existing dataframe with exclusion of some columns. Let us look through an example:

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5, 4), columns=list('ABCD'))
df

          A         B         C         D
0 -1.023134 -0.130241 -0.675639 -0.985182
1  0.270465 -1.099458 -1.114871  3.203371
2 -0.340572  0.913594 -0.387428  0.867702
3 -0.487784  0.465429 -1.344002  1.216967
4  1.433862 -0.172795 -1.656147  0.061359

df_new = df[df.columns.difference(['B', 'D'])]
df_new

          A         C
0 -1.023134 -0.675639
1  0.270465 -1.114871
2 -0.340572 -0.387428
3 -0.487784 -1.344002
4  1.433862 -1.656147

The function sklearn.compose.make_column_selector is an implementation of pandas.DataFrame.select_dtypes that can also select columns based on a pattern in their names:

import pandas as pd
from sklearn.compose import make_column_selector

data = pd.read_csv("../datasets/house_prices.csv", na_values="?")
garage_data = make_column_selector(pattern="Garage")(data)
garage_data

['GarageType',  'GarageYrBlt',  'GarageFinish',  'GarageCars',
 'GarageArea',  'GarageQual',  'GarageCond']