What is the difference between data.columns.difference and column_selector commands.
The function dataframe.columns.difference()
gives you complement of the values that you provide as argument. It can be used to create a new dataframe from an existing dataframe with exclusion of some columns. Let us look through an example:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5, 4), columns=list('ABCD'))
df
A B C D
0 -1.023134 -0.130241 -0.675639 -0.985182
1 0.270465 -1.099458 -1.114871 3.203371
2 -0.340572 0.913594 -0.387428 0.867702
3 -0.487784 0.465429 -1.344002 1.216967
4 1.433862 -0.172795 -1.656147 0.061359
df_new = df[df.columns.difference(['B', 'D'])]
df_new
A C
0 -1.023134 -0.675639
1 0.270465 -1.114871
2 -0.340572 -0.387428
3 -0.487784 -1.344002
4 1.433862 -1.656147
The function sklearn.compose.make_column_selector
is an implementation of pandas.DataFrame.select_dtypes
that can also select columns based on a pattern in their names:
import pandas as pd
from sklearn.compose import make_column_selector
data = pd.read_csv("../datasets/house_prices.csv", na_values="?")
garage_data = make_column_selector(pattern="Garage")(data)
garage_data
['GarageType', 'GarageYrBlt', 'GarageFinish', 'GarageCars',
'GarageArea', 'GarageQual', 'GarageCond']