reliability_and_calibrated_prediction.ipynb

https://github.com/pymc-devs/pymc-examples/blob/main/examples/case_studies/reliability_and_calibrated_prediction.ipynb

(Reliability Statistics and Predictive Calibration)=

Reliability Statistics and Predictive Calibration

:::{post} January, 2023 :tags: time-to-failure, prediction, calibration, survival analysis, censored :category: intermediate :author: Nathaniel Forde :::

:::{include} ../extra_installs.md :::

Reliability Statistics

When we want to make inferences about likely failures on a production line, we may have large or small sample data set depending on the industry, nature of the goods or specificifty of the question we're seeking to answer. But in all cases there is a question of cost and a quantity of tolerable failures.

A reliability study therefore has to account for the period in which a failure is important to observe, the cost of the failure and cost of running a mis-specified study. The requirements for precision in the definition of the question and the nature of the modelling exercise are paramount.

The key feature of time-to-failure data is the manner in which it is censored and how this biases traditional statistical summaries and estimation techniques. In this notebook we're going to focus on the prediction of failure times and compare the Bayesian notion of a calibrated prediction interval to some frequentist alternatives. We draw on the work in the book Statistical Methods for Reliability Data. For details (see {cite:t}Meeker2021)

Types of Prediction

We might want to know about different views of the failure time distribution such as:

Time to failure of a new item
Time until k failures in a future sample of m units

While there are non-parametric and descriptive methods that can be used to assess these kinds of question we're going to focus on the case where we have a probability model e.g. a lognormal distribution of failure times $F(t: \mathbf{\theta})$ parameterised by an unknown $\mathbf{\theta}$.

Structure of the Presentation

We will

Discuss non-parametric estimates of the cumulative density function CDF for reliability data
Show how a frequentist or MLE of the same function can be derived to inform our prediction interval
Show how Bayesian methods can be used to augment the analysis of the same CDF in cases with sparse information.

Throughout the focus will be how the understanding of the CDF can help us understand risk of failure in a reliability setting. In particular how the CDF can be used to derive a well calibrated prediction interval.

Example Failure Distribution

In the study of reliability statistics there is a focus on location-scale based distributions with long tails. In an ideal world we'd know exactly which distribution described our failure process and the prediction interval for the next failure could be defined exactly.

Estimation of the Failure Distribution from Data

In the real world we rarely have such exact knowledge. Instead we start with altogether less clear data. We will first examine failure data about heat exchanges across three plants and pool the information to quantify the lifetime of the heat-exchanges over the three factories.

The data is small deliberately so we can focus on the descriptive statistics involved in assessing time-to-failure data. In particular we'll estimate the empirical CDF and survival functions. We will then generalise this style of analysis to a larger data set afterwards.

Heat Exchange Data

Note on Censored Data: See below how the failure data flags whether or not an observation has been censored i.e. whether or not we have observed the full course of the life-time of the heat-exchanger. This is a crucial feature of failure time data. Too simple a statistical summary will be biased in its estimation of the prevalance of failure by the fact that our study has not seen out the full-course of every item's life-cycle. The most prevalent form of censoring is so called "Right censored" data where we have not seen the "failure" event for a subset of the observations. Our histories are incomplete due to prematurely ending the data collection.

Left censoring (where we don't observe an item from the beginning of their history) and interval censoring (both left and right censoring) can also occur but are less common.

It's worth taking some time to walk through this example because it establishes estimates of some key quantities in time-to-failure modelling.

First note how we're treating time as a series of discrete intervals. The data format is in discrete period format, since it records aggregate failures over time. We'll see below another format of failure data - the item-period format which records each individual item over all periods and their corresponding status. In this format the key quantities are the set of failed items and the risk_set in each period. Everything else is derived from these facts.

First we've established across all three companies in three consecutive years the number of heat-exchanges that were produced and subsequently failed. This provides an estimate of the probability of failure in the year: p_hat and its inverse 1-p_hat respectively. These are further combined over the course of the year to estimate the survival curve S_hat which can be further transformed to recover estimates of the cumulative hazard CH_hat and the cumulative density function F_hat.

Next we want a quick a dirty way to quantify the extent of the uncertainty in our estimate of the CDF. For this purpose we use greenwood's formula to estimate the variance of our V_hat of our estimate F_hat. This gives us the standard error and the two varieties of confidence interval recommended in the literature.

We' apply the same techniques to a larger dataset and plot some of these quantities below.

The Shock Absorbers Data: A Study in Frequentist Reliability Analysis

The shock absorbers data is in period format but it records a constantly decreasing risk set over time with one item being censored or failing at each time point i.e. removed from testing successfully (approved) or removed due to failure. This is a special case of the period format data.

Maximum Likelihood Fits for Failure Data

In addition to taking descriptive summaries of our data we can use the life-table data to estimate a univariate model fit to our distribution of failure times. Such a fit, if good, would enable us to have a clearer view of the predictive distribution a particular set of predictive intervals. Here we'll use the functions from the lifelines package to estimate the MLE fit on right-censored data.

Although it's tempting to take this model and run with it, we need to be cautious in the case of limited data. For instance in the heat-exchange data we have three years of data with a total of 11 failures. A too simple model can get this quite wrong. For the moment we'll focus on the shock-absorber data - its non-parametric description and a simple univariate fit to this data.

This shows a good fit to the data and implies, as you might expect, that the failing fraction of shock absorbers increases with age as they wear out. But how do we quantify the prediction given an estimated model?

The Plug-in-Procedure for calculating Approximate Statistical Prediction Intervals

Since we've estimated a lognormal fit for the CDF of the shock absorbers data we can plot their approximate prediction interval. The interest here is likely to be in the lower bound of the prediction interval since we as manufacturers might want to be aware of warranty claims and the risk of exposure to refunds if the lower bound is too low.

Bootstrap Calibration and Coverage Estimation

We want now to estimate the coverage implied by this prediction interval, and to do so we will bootstrap estimates for the lower and upper bounds of the 95% confidence interval and ultimately assess their coverage conditional on the MLE fit. We will use the fractional weighted (bayesian) bootstrap. We'll report two methods of estimating the coverage statistic - the first is the empirical coverage based on sampling a random value from within the known range and assess if it falls between the 95% MLE lower bound and upper bounds.

The second method we'll use to assess coverage is to bootstrap estimates of a 95% lower bound and upper bound and then assess how much those bootstrapped values would cover conditional on the MLE fit.

We can use these bootstrapped statistics to further calculate quantities of the predictive distribution. In our case we could use the parametric CDF for our simple parametric model, but we'll adopt the empirical cdf here to show how this technique can be used when we have more complicated models too.

Next we'll plot the bootstrapped data and the two estimates of coverage we achieve conditional on the MLE fit. In other words when we want to assess the coverage of a prediction interval based on our MLE fit we can also bootstrap an estimate for this quantity.

These simulations should be repeated a far larger number of times than we do here. It should be clear to see how we can also vary the MLE interval size to achieve the desired coverage level.

Bearing Cage Data: A Study in Bayesian Reliability Analysis

Next we'll look at a data set which has a slightly less clean parametric fit. The most obvious feature of this data is the small amount of failing records. The data is recorded in the period format with counts showing the extent of the risk set in each period.

We want to spend some time with this example to show how the frequentist techniques which worked well to estimate the shock-absorbers data can be augmented in the case of the Bearing cage data. In particular we'll show how the issues arising can be resolved with a Bayesian approach.

To estimate a univariate or non-parametric CDF we need to disaggregate the period format data into an item-period format.

Item Period Data Format

As we plot the empirical CDF we see that the y-axis only ever reaches as maximum height of 0.05. A naive MLE fit will go dramatically wrong in extrapolating outside the observed range of the data.

Probability Plots: Comparing CDFs in a Restricted Linear Range

In this section we'll use the technique of linearising the MLE fits so that can perform a visual "goodness of fit" check. These types of plots rely on a transformation that can be applied to the location and scale distributions to turn their CDF into a linear space.

For both the Lognormal and Weibull fits we can represent their CDF in a linear space as a relationship between the logged value t and an appropriate $CDF^{-1}$.

We can see here how neither MLE fit covers the range of observed data.

Bayesian Modelling of Reliability Data

We've now seen how to model and visualise the parametric model fits to sparse reliability using a frequentist or MLE framework. We want to now show how the same style of inferences can be achieved in the Bayesian paradigm.

As in the MLE paradigm we need to model the censored likelihood. For most log-location distributions we've seen above the likelihood is expressed as a function of a combination of the distribution pdf $\phi$ and cdf $\Phi$ applied as appropriately depending on whether or not the data point was fully observed in the time window or censored.

$L(\mu, \sigma) = \prod_{i = 1}^{n} \Bigg(\dfrac{1}{\sigma t_{i}} \phi\Bigg[ \dfrac{log(t_{i}) - \mu}{\sigma} \Bigg] \Bigg)^{\delta_{i}} \cdot \Bigg(1 - \Phi \Bigg[ \dfrac{log(t_{i}) - \mu}{\sigma} \Bigg] \Bigg)^{1-\delta}$

where $\delta_{i}$ is an indicator for whether the observation is a failure or a right censored observation. More complicated types of censoring can be included with similar modifications of the CDF depending on the nature of the censored observations.

Direct PYMC implementation of Weibull Survival

We fit two versions of this model with different specifications for the priors, one takes a vague uniform prior over the data, and another specifies priors closer to the MLE fits. We will show the implications of the prior specification has for the calibration of the model with the observed data below.

We can see here how the Bayesian uncertainty estimates driven by our deliberately vague priors encompasses more uncertainty than our MLE fit and the uninformative prior implies a wider predictive distribution for the 5% and 10% failure times. The Bayesian model with uninformative priors seems to do a better job of capturing the uncertainty in the non-parametric estimates of our CDF, but without more information it's hard to tell which is the more appropriate model.

The concrete estimates of failure percentage over time of each model fit are especially crucial in a situation where we have sparse data. It is a meaningful sense check that we can consult with subject matter experts about how plausible the expectation and range for the 10% failure time is for their product.

Predicting the Number of Failures in an Interval

Because our data on observed failures is extremely sparse, we have to be very careful about extrapolating beyond the observed range of time, but we can ask about the predictable number of failures in the lower tail of our cdf. This provides another view on this data which can be helpful for discussing with subject matters experts.

The Plugin Estimate

Imagine we want to know how many bearings will fail between 150 and 600 hours based of service. We can calculate this based on the estimated CDF and number of new future bearings. We first calculate:

$\hat{\rho} = \dfrac{\hat{F}(t_1) - \hat{F}(t_0)}{1 - \hat{F}(t_0)}$

to establish a probability for the failure occurring in the interval and then a point prediction for the number of failures in the interval is given by risk_set$\hat{\rho}$.

Applying the Same Procedure on the Bayesian Posterior

We'll use the posterior predictive distribution of the uniformative model. We show here how to derive the uncertainty in the estimates of the 95% prediction interval for the number of failures in a time interval. As we saw above the MLE alternative to this procedure is to generate a predictive distribution from bootstrap sampling. The bootstrap procedure tends to agree with the plug-in procedure using the MLE estimates and lacks the flexibility of specifying prior information.

The choice of model in such cases is crucial. The decision about which failure profile is apt has to be informed by a subject matter expert because extrapolation from such sparse data is always risky. An understanding of the uncertainty is crucial if real costs attach to the failures and the subject matter expert is usually better placed to tell if you 2 or 7 failures can be plausibly expected within 600 hours of service.

Conclusion

We've seen how to analyse and model reliability from both a frequentist and Bayesian perspective and compare against the non-parametric estimates. We've shown how prediction intervals can be derived for a number of key statistics by both a bootstrapping and a bayesian approach. We've seen approaches to calibrating these prediction intervals through re-sampling methods and informative prior specification. These views on the problem are complementary and the choice of technique which is appropriate should be driven by factors of the questions of interest, not some ideological commitment.

In particular we've seen how the MLE fits to our bearings data provide a decent first guess approach to establishing priors in the Bayesian analysis. We've also seen how subject matter expertise can be elicited by deriving key quantities from the implied models and subjecting these implications to scrutiny. The choice of Bayesian prediction interval is calibrated to our priors expectations, and where we have none - we can supply vague or non-informative priors. The implications of these priors can again be checked and analysed against an appropriate cost function.

Authors

Authored by Nathaniel Forde on 9th of January 2022 (pymc-examples#491)

References

:::{bibliography} :filter: docname in docnames :::

Watermark

:::{include} ../page_footer.md :::