Introduction
Model drift refers to the decline of model performance due to changes in data and relationships. Most drift is caused by things entirely out of our control so while we can’t stop it from happening, we can identify and mitigate it.
Feature Drift
Also known as Data Drift, Feature Drift is the changing of the input features over time. This could be introducing new values of categorical features, a change in relation between features, covariate shift, changing data quality, natural and external shifts in data (such as temperature across the year) etc. As this involves the data processed for the model, this makes this the easiest drift to detect and monitor. Analysing summary statistics over time is the most common way to detect Data Drift but there are many measures that can be utilised for this very purpose; these include the Population Stability Index, Kullback–Leibler divergence and Jensen-Shannon divergence. Tracking these measures over time allows us identify drift and even set up automated alerts.
Once feature drift is detected, it is recommended to investigate the feature generation process to identify the cause. Depending on the cause we can rectify data quality issues or retrain our model on this new shifted data.
Label Drift
Label distribution is similar to Feature Drift except it refers to the distribution of the label changing over time rather than the features. This means it also is caused by outside influences, is easier to detect and we can use the same approach to investigate and mitigate.
Prediction Drift
Prediction drift occurs when the model prediction distribution deviates. While closely associated to Label Drift which is caused by external label data deviating, it is instead caused by internal factors causing a shift in the model’s prediction. Unlike the previous two examples, prediction drift doesn’t necessarily indicate model decay. For example, over time your predictions become more skewed in one direction but this is still entirely accurate and your model is working as expected.
The best place to investigate the cause of Prediction Drift is the model training process. If the prediction drift is a result of feature and label drift, it’s recommended that we address that and retrain the model. If the Prediction Drift is accurately reflecting real-life values (ie the predictions are still correct) it’s a matter of assessing and addressing the business impact of these changes in predictions.
Concept Drift
The final type of drift is Concept Drift. This is when external factors cause the labels to evolve and render underlying pattern in data now irrelevant.
Concept drift can occur in several ways: suddenly (an event occurs that instantly changes the accuracy predictions. A good example of this is the Covid19 pandemic); gradually and incrementally over time; or intermittently or recurring, where there is a predictable drift (such as Black Friday sales every year).
Conceptual drift doesn’t necessarily occur alongside Feature and Label Drift, this makes it very difficult to detect however monitoring Prediction Drift may alert you to Concept Drift.
Resolving Concept Drift is also more difficult than the previous examples. The first course of action is to investigate the impact on the underlying pattern in data and we may decide that introducing new features and additional feature engineering will address the drift. In more serious cases, we might need to reconsider the model entirely as the new data behaviour might lend itself better to a different model.
Topics Covered :
Author
Tori Tompkins