Common Cause Failures, more common than you may think
Introduction
Critical systems are usually designed with high redundancy and fault tolerance in order to prevent critical failures. The biggest enemy of redundancy is Common Cause Failure (CCF).
CCF is defined as failures of multiple items, which would otherwise be considered independent of one another, resulting from a single cause [1].
CCF events are usually rare, but their effect may be severe. Therefore, common cause analysis is an important part of safety analysis, and is required in certain standards e.g. Railway safety [2].
CCFs are more common than you may think. Following are a few common cause events that appear in many systems:
- A power failure may cause shutdown of many electrical sub-systems. Although the sub-systems did not fail themselves, they are unable to fulfill their required functionality, and therefore should be considered as failed for the analysis.
- A failure of a network communication switch may prevent many sub-systems from sending / receiving critical information. This may render the sub-systems useless.
Analysis
In simple cases it is possible to account for CCF using standard Fault Tree Analysis (FTA) gates, but in other cases, more complex analysis is required.
Simple Case
Consider a power supply that feeds a server and a network communication switch. The server and switch are required for the system operation. Failure of either of them will cause the system failure. Clearly, failure of the power supply will also cause a system failure, therefore the following simple fault tree can be used:
Not so simple case
Consider the case of two servers and two data storage devices in two separate sites (one server and one storage device in each site).
Communication exists between the two sites, and they mirror each other:
The system can function in the following cases:
- Server 1 and Storage 1 are active
- Server 2 and Storage 2 are active
- Server 1 and Storage 2 are active
- Server 2 and Storage 1 are active
Ignoring the power sources, a simple fault tree can be used in this case as well:
However, a failure of Power 1 causes Server 1 and Storage 1 to fail, and failure of Power 2 causes Server 2 and Storage 2 to fail. The Fault Tree that accounts for the power sources is as follows:
*Fault Tree images taken from BQR’s Fault Tree Analysis software.
Note that event “Power 1 failure” appears twice in the diagram. Usually each end node in the diagram represents an independent event, but in this case the two “Power 1 failure” events represent the same event. Similarly, “Power 2 failure” appears twice in the diagram.
There are 6 blocks in the system, therefore there are 64 possible system states. Of the 64 states, 47 states are defined as system failure.
In order to calculate the failure probability of this case, a process of disjointing has to be carried out [3].
BQR’s FTA software accounts for CCFs as well as nested CCFs (common cause that appears inside another common cause).
For more information regarding BQR software and/or professional services, please contact info@bqr.com.
Bibliography
[1] IEC 60050, International Electrotechnical Vocabulary.
[2] EN 50126:2017 Railway Applications. The Specification and Demonstration of Reliability, Availability, Maintainability and Safety (RAMS). Generic RAMS Process.
[3] IEC 61025:2007 Fault tree analysis (FTA).