Redundant columns

Can you provide please some examples of problems due to redundant features for machine learning algorithms.

  • Some estimators such as sklearn.linear_model.LinearRegression can have unstable or even numerical convergence issues on data with multi-colinearity.

  • If you have too many features compared to the number of samples, with correlations + noise, then your model will likely suffer from more overfitting compared to a model trained on a small set of features with the same information. This true for any estimator class.

  • If you try to inspect the values of the coefficients to draw conclusions on which features are statistically associated with the target, then correlated features makes it complicated.

For more details on the last point see:

That being said, redundant features are not necessarily that of a big problem in practice if your main goal to find build a pipeline with the highest cross-validation accuracy and that drawing statistical conclusions on the result of your model fit is not your main concern.