Machine learning model drifts and benchmarks
Industries and workplaces have already changed considerably with the advent of data science, data analytics, big data, machine learning and artificial intelligence (AI). Over the past decade, these new technologies have helped us to largely automate and improve processes such as lending, fraud detection, forecasting customer churn, optimizing staffing and scheduling, and many other more or less complex tasks. They are also used for recommendation engines based our past behavior. A prime example of this would be the next titles if you are on YouTube. But also in the context of other streaming services, where you consume movies, songs and other media forms, suggestions can be derived based on the interpreted taste of the users. Besides such rather «simple» modeling tasks, there are also much more complex application areas of machine learning and neural networks. Examples of such disciplines are computer vision, speech recognition, text comprehension, complex reasoning and any interdisciplinary domains.
While much is written about high-level concepts and application areas of data science, data analytics, big data, machine leaning and AI, but also researched a lot in science, central aspects of them such as model drifts and benchmarks are often only mentioned in passing. Therefore the following is a first short introduction to the topics.
What are machine learning models?
First and foremost: We generally speak of machine learning when we load a lot of data into a computer program and select or configure a model that can explain the data with as much accuracy as possible, so that this computer program can make future predictions based on a new input following the type of previous data. The way the computer creates such models is determined by different types of algorithms. These range from simple equations such as the functional equations of a straight line to very complex systems of logic and mathematics that guide the computer to the best possible predictions. In this process, features are identified which lead to the best possible attempt to explain the data at hand.
The term machine learning is aptly chosen because once the model to be used has been created and fine-tuned (i.e. improved by adjustments), the machine uses the model to learn the patterns in certain data that help to make even better predictions.
In general, machine learning distinguishes between three different types of problems: supervised learning, where data on a problem with its outcome is known (often referred to as labelled data), unsupervised learning, where this is not the case, and reinforcement learning, which is used in particular for problems where no optimum solution is known. More about these various domains of problems on another occasion.
What is a model drift?
Model drifts can occur when there is some form of change to feature data or target dependencies. We can roughly divide these drifts into the following three categories: concept drift, data drift and upstream data changes.
When the statistical properties of the target variable change, the concept of what you want to predict also changes. For example, the definition of what constitutes a fraudulent transaction may change over time as new ways of conducting such illegal transactions are developed. This type of change will lead to a concept drift.
Data drift, as the name suggests, is about the raw data for model creation. This is because the features used to train a model are calculated from the original input/training data, as mentioned above. So if the statistical properties of this new input data change in comparison, this will subsequently affect the quality of the model. For example, data changes due to seasonality, changes in personal preferences, trends, etc., will cause a drift in the incoming data and thus in the model used.
Upstream data change
Last but not least, operational changes may sometimes occur in the upstream data pipeline, which may affect the model quality too. For example, changes to the characteristic coding, such as the change from Fahrenheit to Celsius, or characteristics that are no longer generated and result in zero or missing values, can lead to unwanted upstream data changes.
How model drifts can be identified and prevented
Given that such changes and drifts can occur after a model is deployed to production (i.e. into normal operations), the best practice approach is to watch for changes and take action as quickly as possible if they do occur. A feedback loop from a monitoring system and regular updating of models will help to avoid model abstinence or at least significantly reduce the probability of it happening.
What are Model Benchmarks?
The creation of a model and its use is therefore obviously not the end of a development cycle. During deployment, the company must therefore define relevant key figures and benchmarks for the model. The key figures are so-called KPIs. If you have ever dealt with statistics, you might remember the story of True Positive, True Negative, False Positive and False Negative. These are now being used again here.
In addition to the accuracy KPI, which tells you, for example, how many invoices from the population of the documents examined have been correctly classified – either as invoices or as not-invoice – precision and recall are also relevant. Precision differs from accuracy in that it shows how many invoices have been correctly classified as invoices in relation to all documents classified as invoices. Recall (also called sensitivity), on the other hand, indicates the proportion of documents correctly classified as invoices in relation to all invoices among the documents examined.
You see, we can’t optimize precision and recall at the same time, which means nothing else but that there is an interdependence which we can also visualize. One KPI that mixes precision and recall and is often used to evaluate models is the F1 Score. Another evaluation tool is the ROC curve. This is particularly common in binary classifications and is very similar to the relevance and sensitivity curve. In comparison, however, it does not illustrate the relationship between precision and sensitivity, but the relationship between true positive and false positive.
The benchmarks used in real life differ depending on the model and its use cases. An example: If you are developing a classification model for detecting cancer on X-rays, you would rather identify all potential cases than be very sure where you suspect cancer cells are located. Elsewhere, however, the situation is quite different. Our customers, for example, place greater emphasis on accuracy. But if we were to optimize only for this metric, it would lead to incorrect document classifications or extracting text characters at certain points, which is not in the interests of our customers either. As you can see, the whole thing is quite a balancing act.
What we have not yet addressed are the fundamentals, i.e. the data on which we base these benchmarks for evaluating model performances. Since, as described above, the environment of the models in real-world machine learning applications is often characterized by change, their outputs can change too. Here again an example to illustrate this: An algorithm has been trained to predict the sales price of a property based on various characteristics such as age, number of rooms, bedrooms, living space, population of the region, median income and other attributes. However, all training data is derived from sales figures for a specific region. Applying this pricing model to the real estate market of another country is likely to result in large variations in local prices and therefore in a model drift. An alternative drift could occur if your data is not up-to-date and therefore the potential market changes have not been priced in in the meantime. It is therefore extremely important to find out what exactly the benchmark is or should be if you want to reliably evaluate your machine learning models for their performance. This is particularly relevant if you want to update it and adapt it to new circumstances without possibly worsening the performance for certain data types that are still relevant. A science in itself…