Fault Tree Analysis: How accurate is it?

 

Introduction:

When conducting Fault Tree Analysis (FTA), small probabilities matter. For example: According to (ARP 4761 and AC 23.1309-1E) U.S. Department of Transportation, a catastrophic event i.e.” Failure conditions which would prevent continued safe flight and landing” should have a probability smaller than 10-9 per flight hour. Similar requirements exist also in the Rolling Stock industry.
FTA calculations are routinely conducted by using Excel sheets and FTA dedicated software. The question is asked: How accurate are these calculations?
There is reason to believe (Keisner, A. 2003) that different FTA software provides different results that might lead to severe safety events.
The objective of this paper is to compare various probability calculation methods, identify potential problems, and present the solutions.
A known issue with FTA is the question of truncation i.e. when to terminate the summation of minimal cut-set probabilities. This issue was treated in literature (Čepin, M. 2004; Epstein, S. & Rauzy, A. 2005).
In this paper we present several cases where naïve implementation of probability equations for a single logical gate result in serious computer generated errors.
The first case we discuss is the OR gate for which a simple solution exists. The second and third cases are dynamic gates (Standby and AND-priority) which pose a more formidable challenge. We discuss the reasons for the computational errors, as well as the solutions for the problems.

 

OR gate:

According to Standard IEC 61025 (IEC 61025, 2006), the failure probability F(t) up to time t of an OR gate is given by:

(1)

 

Where Fi(t) is the occurrence probability of sub event i up to time t. When values of Fi(t) become small, of order 10-16, Standard calculation (using Excel, Matlab or other standard programs / languages) gives incorrect results. The reason is as follows:
Eq. 1 includes terms of the form 1- Fi(t). These terms mix numbers of order 1 and order 10-16, i.e. more than 16 digits are required in order to accurately describe the number. Standard computer calculations use double-precision numbers that have an accuracy lower than 16 digits. Therefore, terms of the form 1- Fi(t) are not calculated correctly.
One possible solution is to use non-standard precision in order to compute the probability, however such methods consume much more memory and time.
Eq. 1 can be computed accurately when the terms in Eq. 1 are written explicitly according to the Sylvester–Poincaré expansion (NUREG-0492 1981). For example, consider a case with three sub-events. Expanding Eq. 1 gives:

(2)

 

 

 

 

Note that in Eq. 2 the contributions of order 1 canceled out. Therefore, the computation error is avoided.
Example: consider three events, each with a prob-ability of 10-16. A simple implementation of Eq. 1 in Matlab yields F(t)= 3.330669073875470e-016.
An implementation of Eq. 2 gives: F(t)= 2.999999999999999e-016. In this case Eq. 1 gave a deviation of 11% from the correct result. Next, more complicated cases are presented.

 

STANDBY GATE:

The Standby gate is used to describe a system with backup units. Here the case of no-recovery is discussed. Example: Consider a satellite CPU. In order to achieve high reliability for the satellite mission period, several backup CPUs are installed. When the CPU fails, one of the backup units replaces it. Critical failure occurs when the primary and all backup CPUs fail. The failure probability for N identical units (one primary and N-1 backups) is:

(3)

 

 

Where f is the single CPU failure distribution function. When f is exponential with a constant failure rate λ, an analytic solution for Eq. 3 can be found by applying Laplace and inverse Laplace transforms:

 

(4)

 

 

Eq. 4 reveals a potential problem: the 1st term of Eq. 4 is 1. When λ·t<<1, the contribution to Eq. 4 from the 2nd term with k=0 is very close to -1, i.e. the two terms almost cancel each other. In such cases erroneous calculation results can occur.
In order to solve this problem, an alternative form of Eq. 4 is required. Note that the sum in Eq. 4 can be written as:

(5)

 

 

 

Using Eq. 5, a new form for Eq. 4 is found:

(6)

 

 

Note that Eq. 6 no longer has the nearly canceling terms of order 1. The lowest order term in Eq. 6 is of order (λ·t)N. In order to use Eq. 6, a stopping condition is required for the sum of infinite terms. The stopping condition depends on the required accuracy.
Example: consider a system with 4 identical units in standby configuration. Furthermore, assume that λ·t=10-5.

Implementing Eq. 4 in Matlab yields a negative result: -2.220446049250313e-016 whereas Eq. 6 gives: 4.166633333472224e-022. A negative result is “good” because then the error is immediately detected. If however, λ·t=2∙10-5 then Implementing Eq. 4 in Matlab yields 1.110223024625157e-016 whereas Eq. 6 gives: 6.666560000888887e-021. In this case the direct use of Eq. 4 gives a result that is wrong by more than 4 orders of magnitude, but is difficult to detect. This could lead to unnecessary addition of spare units.

 

AND – PRIORITY GATE:

The AND – Priority (ANDp) event occurs when all the sub events occur at a specific order. ANDp events commonly exist in FTA of systems that have several protection layers. The ANDp gate was first introduced in 1976 by Fussel et al (Fussel, J. et al 1976).
Example: A gun will fire unintentionally only if the following events take place in sequence:
-The bullets magazine is inserted into the gun
-The gun is loaded
-The trigger is pulled
Consider the case of N sub events, each having an occurrence rate λi. The occurrence probability for the ANDp gate is given by a convolution of the form:

(7)

 

 

where fi(t) are exponential failure distributions, each having a failure rate λi. The convolution of Eq. 7 can be solved by using the Laplace transform:
Define another failure rate λN+1=0.
Define coefficients ui such that:

(8)

 

 

 

Then F(t) is given by:

(9)

 

 

Eq. 9 is an exact solution of the convolution in Eq. 7. However, there are cases in which computation of Eq. 9 gives incorrect results:
When sub-event probabilities are small, significant deviations from the expected value occur. Consider the simple case of two sub events (N=2). In this case, Eq. 9 reduces to:

(10)

 

 

 

 

Eq. 10 includes addition and subtraction of numbers with very close values, this is the cause of the computation problems. This becomes clearer when the Taylor expansion of the exponents in Eq. 10 is taken:

(11)

 

 

 

The 0th and 1st order terms of the Taylor expansion cancel out. While the cancellation is easy to identify analytically, the numerical cancellation (by computer) is not exact, and errors may arise in the computation. Therefore, Eq. 11 gives better calculated accuracy compared to Eq. 10 when small probabilities exist.

In order to generalize the conclusion to the case of N sub events, Eq. 9 is given in a different form (see appendix for proof of equivalence):

(12)

 

 

 

 

Taylor expansion of the exponents in Eq. 12 gives:

(13)

 

 

 

 

From Eq. 13 it is clear that contributions for m=0, 1, .., N-1 cancel out (due to linear dependence of rows in the determinant). Therefore, an accurate calculation of F(t) when small probabilities exist requires a Taylor expansion which starts at order N.
Two more improvements can be applied to Eq. 13:
Notice that the 2nd row of the matrix in Eq. 13 is a row of ones. Furthermore, the last column of the matrix in Eq. 13 is full of zeros (with one exception). This allows for a dimensional reduction of the matrix:

(14)

 

 

 

 

The denominator of Eq .14 includes terms of the form (uj-uk), care should be taken in case that the values of uj and uk are similar, therefore use the following equation:

(15)

 

 

 

 

Eq. 15 is much more robust compared to Eq. 9.
Further complications may arise in the special case where columns of the matrix in Eq. 15 are very close in value. The way to overcome these problems is similar in spirit to the methods which were shown in this section.
Example: Consider the simple case of two similar events for which λ1·t=10-8 and λ2·t=10-8. The occurrence probability of each event is 10-8, therefore, the probability of both events occurring and event 1 taking place before event 2 is: 5·10-17. Implementing Eq. 10 in Matlab yields: 8.271806125530277e-17 i.e. a deviation of 65% from the correct result. Implementation of Eq. 15 gives the correct result.

Fig. 1 presents the fault tree diagram of the case described above using BQR FTA software (BQR FTA user manual):

 

Figure 1. Fault tree diagram of an AND-priority gate with two child events.

 

DISCUSSION AND CONCLUSIONS:

Probability calculations of OR, Standby, and AND – priority gates were examined. It was found that naïve implementation of probability equations can lead to computation errors, and even negative results.
It was shown that accurate computation can be achieved by using alternative representations of the probability equations.

BQR’s FTA software offers high accuracy, flexibility and calculation speed.

 

APPENDIX

Proof of equivalence between Eqs. 9 and 12:
Beginning with Eq. 12 and expanding the determinant according to the Leibniz formula one obtains:

(A1)

 

 

 

 

The determinants in Eq. A1 are of the form of Vandermonde determinants, therefore:

(A2)

 

 

 

 

Eq. A2 is simplified by noticing the similar terms in the numerator and denominator:

(A3)

 

 

 

 

 

Further simplification is obtained by considering the cases where j=i and k=i:

(A4)

 

 

 

 

Replacing (uj-ui) with (ui-uj) and accounting for the sign changes yields:

(A5)

 

 

 

 

 

And finally, Eq. 9 is recovered:

(A6)

 

 

 

 

REFERENCES

AC 23.1309-1E, Federal Aviation Administration, U.S. De-partment of Transportation

 

ARP 4761, Guidelines and Methods for Conducting the Safety Assessment Process on Civil Airborne Systems and Equip-ment, SAE international

 

BQR FTA user manual, www.bqr.com/products/care/fta-fault-tree-analysis/

 

Čepin, M. (2005), Analysis of truncation limit in probabilistic safety assessment. Reliability Engineering and System Safety 87, 395-403

 

Epstein, S. & Rauzy, A. (2005). Can we trust PRA? Reliability Engineering and System Safety 88, 195-205

 

Fussel, J., Aber, E. and Rahl, R. (1976). On the quantitative analysis of priority-and failure logic. IEEE Transactions on Reliability R-25(5), 324–326

 

IEC 61025, 2006, Fault Tree Analysis (FTA), International Electrotechnical Commission

 

Keisner A., 2003, Reliability Analysis Technique Comparison, as Applied to the Space Shuttle, Space systems design labora-tory, Georgia Tech, 35

 

NUREG-0492, 1981, Fault Tree Handbook, Systems and Reliability Research, Office of Nuclear Regulatory Research, U.S. Nuclear Regulatory Commission, VI-4