How to simulate real cases where the shape of data are different from the shape of the training data

faten_l · 28 October 2022 10:43

To the best of my knowelege, when testing ML models, the shape of testing data should be the same shape of the data used to train the model. However, in real cases, the shape may be different (i.e number of features) for example network traffic. How to use a pre trained model in real case where the shape of data is different of the shape of training data. Thank you in advance

ogrisel · 28 October 2022 11:58

Machine learning models can never use features that were never seen at training time.

Note that some machine learning models such as HistGradientBoostingClassifier/Regressor can handle missing values quite well (typically represented as numpy.nan or pandas.NA). But ideally the frequency of occurrence of missing values in a given column should approximately be the same in the training and test data. Otherwise there is no guarantee that the models will perform well.

faten_l · 28 October 2022 12:45

Thank you for your reply!
So how to deal with real situations where the data my be different. What the solution could be ?

ogrisel · 31 October 2022 18:30

If the features are completely different between train and test, then machine learning is useless.

If the feature names and the feature values are meaningful English names, then you might try to preprocess the tabular data to treat it as a “natural” language English data with a bit of Python code to generate string such as:

“feature name 1 = value of feature name 1, feature name 42 = value of feature name 42, …”

Then transform that text using a pre-trained language model using BERT or a similar model from huggingface using the transformers package.

You should get a few hundreds or thousands fixed numerical dimensions as a result then should always be the same both on the train and testing set and hopefully quite meaningful. The you can train your classifier on those numerical features instead of the original features.

However if the feature names and values are meaningless identifiers or continuous values, then this will not work,