Standby vs. Spare parts: an example of integrated reliability and maintenance design

 

Introduction:

Asset availability depends both on the sub-systems’ reliability and on down-time due to failures. The problem is that system reliability and down-time belong to different disciplines: component reliability is under the responsibility of the reliability engineer while down-time is an issue addressed by maintenance and operations engineers.

A striking example of the inter-connection between reliability and maintenance is the choice between designing a system with standby redundancy, and replacing the redundancy with a spare parts maintenance policy. The similarities and differences between the two approaches will be explored in this paper in terms of availability and cost.

Example – pump:

We being with a simple example: Consider an oil pump with a mean time between failure (MTBF) of 3 years (26,280 hours). When the pump fails the mean time to repair (MTTR) is one week (168 hours). The pump availability is therefore: 99.365%. This means that the pump is unavailable on average 2.3 days each year. In order to improve the situation one can either design the system with a second pump on standby, or put a second pump as a spare part nearby.

Standby scenario:

Initially the main pump works and the backup pump is not working (cold standby). When the operating pump fails the backup pump immediately replaces it. The failed pump is sent to the repair shop (hot repair). If the repair process finishes before the backup pump fails, the system goes back to the initial state, otherwise a system failure occurs until one of the pumps is repaired.

The scenario described above can be modeled as a renewal process for which a simple Markov chain diagram is given in Figure 1:

Figure 1: Markov chain diagram for renewal process. λ is the pump failure rate and μ is the single pump repair rate. Green node denotes the state in which one pump is working and one pump is in standby, the yellow node describes a state in which one pump is working and one pump is being repaired, and the red node describes a failed state in which both pumps are being repaired.

λ is the pump failure rate and μ is the single pump repair rate. In many cases (including the example above) λ / μ <<1 therefore the renewal process can be approximated by a Poisson process for which the steady state availability is:

For the values presented above, the availability is 99.998% (mean annual downtime of 10.5 minutes), a significant improvement.

 

Single spare scenario:

When the pump in the field fails, it is immediately replaced by the spare pump. The failed pump is sent to the repair shop. If the repair process finishes before the spare pump fails, the system goes back to the initial state, otherwise a system failure occurs until one of the pumps is repaired.

 

Availability:

The two processes described above are almost identical, the only difference is that in the Single spare scenario the backup pump is waiting in the storage room whereas in the standby scenario the backup pump is waiting in the field.

In order to account for the difference between the cases, the pump replacement time should be added to the model. This is done as follows:

First define an effective repair rate: μ* such that the availability in Eq. 1 is

μ* is found by using Eqs. 1 and 2:

μ* represents the inverse mean down time when a pump failure occurs. Next, define the pump replacement time t, then the new Availability A* is:

The coefficient of t in Eq. 4 depends on details of the pump replacement and resulting elaborated Markov Chain process. Eq. 4 shows the expected availability of the pump system depending on the pump replacement time. t is usually larger for the spare part case compared with the standby case due to the transportation, removal and assembly times. Therefore, it is better to use a standby pump. However, there is another element which was not considered so far: cost.

Cost:

A high cost is usually incurred per hour of system down-time. The total down time during life cycle is:

where tdown is the down time and tlife is the lifecycle period. Other cost factors for the standby scenario are due to demand for parallel piping, power supplies and increased floor-space; while the spare part scenario requires storage and packaging expenses.

When down-time is very costly, a standby solution is usually preferred. Indeed, in many oil refineries, remote water supply stations and critical systems a standby design in used.

The advantage of using spare pumps instead of standby units becomes apparent when many identical systems use a shared stock. Then fewer pump units have to be purchased. This gives a substantial financial saving.

Example – 10 pumps:

Consider a line with 10 pumps in series. In order to maintain a high availability, two possibilities are considered:

 

Standby scenario:

Assume that a standby pump was added for each pump (having a total of 20 pumps). Furthermore, assume that upon failure the pump switching time is negligible.

The main costs are: single pump cost of 500,000$, single pump repair cost of 5,000$, and downtime damage of 20,000$ per hour. The total cost for a lifecycle of 20 years was calculated using the apmOptimizer software to be: 11,373,720$ with a line availability of 99.979%.

 

Spares scenario:

Instead of the 10 standby pumps, 2 spare pumps are put in storage (total of 12 pumps). The stock of two spare pumps is shared by all the pumps in the field. The pump switching time is assumed to be 2 hours.

The main costs are: single pump cost of 500,000$, single pump repair cost of 5,000$, and downtime damage of 20,000$ per hour. The total cost for a lifecycle of 20 years was calculated using the apmOptimizer software to be: 8,997,945$ with an availability of 99.924%.

 

Comparison:

The standby design gives higher availability compared to the spares design, however the lifecycle cost of achieving this availability is higher than the spares design by more than 2,375,000$. The optimal number of spares for the spares scenario is 2, fewer spares incur a high penalty due to low availability while adding additional spares (3 or more) gives negligible availability improvement.

 

Conclusion:

In this paper we discussed the similarities and differences between using standby units and having spare units. This is an example of the connection that exists between redundancy design which is usually the job of the reliability engineer, and maintenance policy which is classically set by the maintenance engineer.

The examples above demonstrate the need for both reliability and maintenance to be considered as early as the asset design stage. New design tools such as BQR’s CARE and apmOptimizer software suites can greatly assist in such a process.