Preprocessing in reality

GiorgiKvinikadze · 22 February 2022 16:39

Does it good idea to train model on standardised data when we want to deploy this model on real-time environment?
I mean, if we train and test model on standardised data, without outliers in it. I think, it will have problems if new ‘test’ entered in it. Also, data which is logged sometimes has different noise and it may be changed from time to time and does it cause problems for model?

qdpham · 22 February 2022 16:57

Hi,

I think that is a very good question.

We are hitting here the problem of data drift / concept drift. I believe scikit-learn contributors are working on it (it was on the roadmap, wasn’t it? If not ready, do we have an expected time for that?) and that would be a very important feature to add in a pipeline before considering to deploy a model.

ogrisel · 24 February 2022 14:13

Indeed, if the distribution drifts over time, the performance of the model is expected to degrade in uncontrolled ways.

Several things can be done to address this problem:

First, detecting the drift:
- This can be achieved by training an unsupervised novelty detection model as a companion model to the supervised model to be deployed in production and then deploying both models and monitoring the volume of novelties detected over time and monitoring it in a dashboard tool and setting alerts if this increases significantly.
- It is also possible to monitor the evolution of histograms of the values of the most predictive input features of the supervised model and of the predictions of this model over time windows (e.g. one histogram per variable every day) and also monitor them on a dashboard with an alert to detect significant changes (distance to the histograms via cross-validation on the training set of the model before deployment).
Second, mitigating the drift. This is more complex. There is no easy solution in general and the safest thing to do is probably to continuously collect new labels on new datapoints to enrich the existing labeled dataset with fresher data. Then retrain new versions of the model on newer versions of the data and re-deploy the new versions continuously (that is once every day or once every week for instance).

Also, keep in mind that sometimes, the distribution of the data you collect is itself influenced by the deployed model (for instance, this is typically the case with recommender systems). In this case, it can be important to go beyond traditional supervised learning and consider try to model causal/interventional effects with randomized controlled experiments, A/B testing, and Bandit algorithms but this is far beyond the scope of this MOOC.

If you want to learn more about practical recommendations from people with practical ML expertise, you might want to read the publications of Daniel Sculley:

and follow the blog and social media accounts of Chip Huyen:

MLOps - tools, best practices, and case studies

qdpham · 25 February 2022 16:27

Just for the digression about the tools for drift detection here.

I would say that, definitely, in the real world, our model will drift. The question is how?

There are basically two kinds of drifts:

Data drift (sometimes also called feature drift), which is the change in the statistical properties of the input data, and so the trained model is not relevant anymore.
Concept drift, which is the change in the relationship between the data/features and the target.

Therefore, it is crucial to be able to detect in time those drifts with a data validation component implemented in an automated pipeline.

The main Cloud providers, e.g. AWS, Azure and GCP, provide drift detection in their ML platforms, but there is some lock-in here.

I was hoping for a component built in scikit-learn for detecting drift. Of course, we don’t want to reinvent the wheel, so any advice for good open-source libraries would be great help.

I have tried TensorFlow Data Validation library from TensorFlow Extended (an end-to-end MLOps pipeline), but it might look like oversized when addressing simple ML models. Otherwise, it’s very useful, when properly implemented.

Looking for more flexibility, I’ve heard about scikit-multiflow but have not tried it. Do you guys have any comment about this library?

There are probably other good tools and libraries for detecting drift

PS: Chip Huyen is awesome! It’s worth following her for latest updates about MLOps.

ogrisel · 25 February 2022 16:37

Some of us might invest more effort in MLOps tooling and best practices in the future but as of now, we don’t have much to offer as part of the scikit-learn project itself.

qdpham · 25 February 2022 17:00

Thanks for updating.