With the evolution of technology, businesses are increasingly adopting machine learning and data analytics to derive solutions. According to Forbes, 50% of all enterprises planned to invest more in ML and AI in 2021.

With this shift towards machine learning, data scientists often have to deal with challenges when models are in production. Last year, 38% of organizations reported that their data scientists spent 50% of their time in model deployment.

One of the challenges that occur is drift in model predictions. Model drift typically refers to shifts in predictions made by the model; hence, what the model forecasts today differs from what it previously predicted. As a result, model predictions can become less reliable in the long run. This occurrence is particularly evident in time series forecast models, where some of the factors that can affect the accuracy of predictions are:

- Shifts in consumer behavior
- Changing monetary policies
- Economic cycle changes
- Competitor Activity

So, how can we respond to drifts in forecast models? We can use either a reactive or proactive approach to ensure model reliability. However, before delving deeper into the two different strategies, we have to familiarize ourselves with a couple of terms:

Within the machine learning community, the concept of *reliability* covers closely related ideas of uncertainty in model predictions.

Moreover, it also encompasses failures to generalize production data. Ultimately, the reliability of model predictions at a specific point in time refers to model reliability. You can trust a reliable model's forecasts up to a certain period before you might want to replace it.

Here, model availability means the availability of a model that satisfies production requirements for model reliability. Now that we've learned about the abovementioned terminology, we are going to go over the two different strategies:

One solution to drifts in model predictions is to adopt a reactive approach and establish thresholds based on forecast-related business rules.

This way, a drift event can be detected if the difference between the actual time series and forecasts crosses a particular threshold point. Consequently, that would trigger retraining existing models or further analysis in an automated setting.

Employing a reactive approach to drift events is reasonable if:

- Such events occur rarely

- Retraining expenses are low due to the low number of models in production

- Model availability does not have strict requirements

However, this approach might not be practicable in some industries because the number of models in production is high. Here, a proactive approach is more convenient.

In industries like retail or logistics, the number of time series that need to be modeled can easily cross thousands. Additionally, these time series may be closely related and belong to the same domain, material, or geography.

In this case, responding to drift events at such a vast scale is unfeasible. That's because responding with further analysis or model retraining can take up a lot of analysis and computational time.

We need to ensure that the model is reliable in production, has decent uptime, and does not drift away and trigger retraining. For this, we need a strategy that can foresee model drift in advance. Here, a proactive approach might be suitable. The objectives of this are to:

- Be able to meet SLA requirements related to model availability.
- Analyze these cases for root cause analysis in a pre-production setting.
- Calculate computational requirements for retraining in advance (If the frequency of these events across models is known).

A Proactive approach can provide a reasonable estimate of these events even when the model is in a pre-production state.

An excellent approach to estimating reliability in time series forecast models is through survival analysis. You can create a survival model to predict model delay and perform a proactive analysis if:

- Significant history is available on model performance
- Numerous closely related models are available

Proactive analysis helps you quantify both model availability under different scenarios and computational requirements for retraining. Moreover, it will help you pre-plan. Before we demonstrate how to use survival analysis to predict the kind of decay expected from a given set of time series, we'll have a brief overview of how the survival model works.

Models that predict the time to an event are called survival models. Survival models fall under a branch of statistics that deals with the analysis of time to an event, called survival analysis.

For example, the time before a loan default or time before an equipment failure are examples of survival models. (Loan default and equipment failure are the "events" in these scenarios.) You can model the survival probability distribution as follows:

Where,

T: waiting time until the occurrence of an event

S(t): Probability that waiting time T is > than some time t. Meaning, probability of survival till time t.

From here, we can go to the concept of censorship. If the time the event occurs is unknown, it's said to be censored. If the observation period of an experiment expires and the subject does not face an event (will face one in the future), it's called *right censored*. *Right censorship* is the most relevant to our forecasting model reliability study.

Survival modeling techniques are classified broadly into two techniques:

These models make assumptions about the form of the survival probability distribution. They can be good choices if their assumptions are backed with domain knowledge.

These models empirically calculate the probability distribution from available data instead of making assumptions. The end goal for both models is to acquire a survival function that, given a particular point in time, would output the survival probability up to that point.

Using the M5 forecasting dataset as an example:

Now, we will take an example of reliability in a time series model problem and walk through its solution. We'll use the M5 forecasting dataset in this example. The fifth edition of the M-series forecasting competitions, directed by Professor Spyros Makridakis, used this dataset. You may access the dataset here.

The data includes sales at different levels of the hierarchy, such as item, department, product categories, and store. In addition, it features retailers from the US States like California, Texas, and Wisconsin.

To undertake this study, we will take some special considerations in two different pipeline phases:

- The pre-processing stage
- The evaluation metric generation

The experiment design is as follows:

Preprocessing involves filtering the stores and departments that interest us depending on the configuration. The sales columns are then melted (converted from wide to long format) into a single column. The sales data is presented at an item level. Therefore, you can sum this up at each store department level. As a result, you can acquire store-level sales data for each department.

Prophet model, a well-known time series forecasting model, built by Facebook research, is created for every store department. These models are developed using default parameters, with a one-year test forecast horizon.

Here, we will walk through all the steps required to perform survival modeling. The two utilities for this are:

- The first tool analyses the test set and computes MAPE (Mean absolute percentage error), a common metric to assess time series forecasts, over an extending window from week 4 to week 26. (6 months).

By assuming that the time to event for a forecasting model in production is the period before its performance deteriorates beyond a specific threshold, we can use survival analysis to model this waiting period and proactively take measures to ensure model dependability.

We can calculate the time before a model degrades, decays, or drifts by applying a metric like expanding window MAPE, which generates evaluation scores for a model against time. Then, we can model that time in turn.

- The first utility provides a time-based evaluation metric on the test set. In contrast, the second utility sets a threshold for the first week when performance begins to decline and records this as an event.

The configuration's MAPE threshold is 10. Simply put, we mark an event when the expanding MAPE crosses ten in the first week. For store-department models, where the threshold isn't crossed by the end of the research, 26 weeks are designated as censored.

Kaplan Meier Estimator is typically the first baseline model for survival analysis. It is a non-parametric estimator of the survival function. Let's assume that there is no domain knowledge and covariates to keep us informed here. In this scenario, we are going to use this estimator for illustration.

Where,

**d**ᵢ: Events that happened at **tᵢ**

**n**ᵢ: Total subjects that did not face an event up to **tᵢ**

Now, we accumulate all event observations and censors from the store-department models and build Kaplan-Meier models for every department.

As a result of fitting the Kaplan Meier Estimator, we get the survival function that is plotted below:

From this figure, we can acquire the survival probability of models belonging to the Hobbies and Food department for every week in the study.

**Probability of survival at the start of the study:** 1.0** **

**Median Survival Time (Time where the probability of survival is 0.5):** 4 weeks for both departments.

We can ensure model reliability by taking preemptive measures like scheduling retraining and conducting analysis without waiting for a performance event in production. To do this, we need the

An example setup could use the 95th percentile survival time of 2 weeks and retrain all store-level models in a specific department before this limit is met in production. As a result, it can ensure that the model degradation is below the threshold with the given probability.

Once we take proactive steps like the ones mentioned using survival analysis, we can facilitate model availability. We can also use this method with models from various categories and across modeling approaches for cases other than retail and logistics.

Moreover, as this approach involves new metrics for generating time-to-failure events, they open up many opportunities for studying model reliability in production. Furthermore, these metrics also serve as individual evaluation metrics, which can be valuable for time series forecasting models.

While this article is specific to issues in forecast modeling, data scientists can apply the same principle to other modeling approaches, given the appropriate evaluation metrics.

Confiz helps you transform and succeed using technology, insights, and innovation.

Get in touch