Forgot to add standard scaler: Convergence warning

I set-up only a simple imputer for preprocessing numerical data by mistake yet the cross validation went through and returned test results but with a convergence warning. Even increasing the maximum iteration in prediction model to 100000 didn’t fix this warning. It went away only when I added a standard scalar.

  1. Can someone explain what’s unique about this dataset that leads to convergence warning even after increasing the max iteration?

  2. How should I decide whether to use a standard scaler or not for a numerical dataset? Is there a thumb rule based on no of samples and/or no of numerical features?

Many numerical solvers used to train machine learning models can only converge if the underlying numerical optimization problem is well-behaved (also known as “not ill-conditioned”). Explaining the details of numerical optimization is beyond the scope this MOOC, but in practice one way to avoid such numerical problems is to make sure that the input features of a machine learning model such as logistic regression are approximately on the same range on average. This means that if feature "a"is on a scale of -1 to 1, having feature “b” on scale of 0 to 10 is perfectly fine, but it it is on a scale from 0 to 100000 then you might run into numerical problems preventing the optimizer to converge.

The message in the scikit-learn convergence warning should suggest to use a preprocessor (such as StandardScaler to scale the features to avoid this problem).

Another way to make it easier to converge would be to decrease the value of the parameter C (that will increase regularization) but this can have a strong impact on the cross-validation performance of the model as will be explained later in the module on linear models.

If you really want to know the mathematical details and have a background in linear algebra and numerical methods, you can learn more about the numerical optimization problem at hand by following videos 15.1 to 15.7 in this playlist: (ML 15.1) Newton's method (for optimization) - intuition - YouTube and reading about ill-conditioned optimization problems.

Numerical scaling is often required or at least useful for many machine learning models such as logistic regression, linear regression, ridge regression, support vector machines, neural networks, k-nearest neighbors… But it is useless for tree-based models.

Hello, thank you very much for the quick response. Yes, the warning highlighted about the StandardScaler usage, I forgot to mention that information in my post. Thanks for detailed explanation and sharing the link as well, I will have a look at the mathematical details, I am quite interested learning the background calculations.

Best Regards
Ajay