Reliability analysis is an important branch of engineering. Thus poor product reliability can lead to a variety of problems including customer dissatisfaction and excessive repair and service costs. These costs may of course fall to the manufacturer under various warranty provisions. On the other hand, having unnecessarily high reliability may also be unattractive. For example, consider a system that needs only one operable power supply in order to function correctly. Such a system may be provided with a secondary or backup power supply that can then be utilised should the original (primary) power supply fail. However, what if the secondary power supply were also to fail? One possibility is to include several backup power supplies, in case of multiple failure. However, the likelihood of such a multiple failure may be exceedingly small. The provision of more than one backup power supply may not be economically worthwhile. In other words, the increased cost of the additional backup components may not be justified by the marginal increase in reliability thereby obtained.
Of course, the trade-off between reliability and cost will vary according to the particular circumstances. In safety critical applications, such as aeroplanes, reliability is of the utmost importance. In contrast, for computing systems, reliability requirements typically vary according to the type of machine. Thus certain server machines may be vital to the operation of a business (such as to take orders, to process accounts, and so on), and are therefore expected to have 24×7 availability. In contrast, an organisation may well be prepared to tolerate the occasional failure of individual desktop machines.
Various methodologies have been developed for analysing and predicting reliability at the design stage. One known approach is known as Failure Modes Effects and Criticality Analysis (FMECA), which is the subject of various formal standards, such as British Standard BS 5760, and US military standard US MIL STD 1629. In FMECA, likelihood of occurrence is normally quantified by a failure rate value, and a numerical value is assigned to the severity of each failure. Combining these two values then gives an indication of criticality—i.e. those components that are both important to the correct operation of the system, and are also most likely to fail. Note that an individual component may have multiple failure modes, all of which need to be taken into consideration (for example, a tyre may burst, or its tread may become worn away).
FMECA includes studying the expected propagation of errors through the system. Thus following through the above example, continued vehicle operation with a worn tyre may be temporarily tolerated, albeit with reduced safety margin, whereas a burst tyre may render the whole vehicle unusable (i.e. has a high severity). This latter consideration therefore provides motivation for providing most vehicles with a spare tyre.
Another design tool that is sometimes used in reliability studies is fault tree analysis. Fault tree analysis generally starts with various system level observations of difficulties (known as consequences or events), and then tries to trace back to the underlying causes, potentially through a whole tree of such causes. For example, a failure of a lamp to operate may reflect a broken bulb, or a problem with the power supply to the lamp. The problem with the power supply may perhaps in turn be due to a broken flex, or may instead possibly reflect a human or operator failure, such as nobody having plugged the lamp into an electric power socket.
This sort of analysis allows a fault tree for a given device or system to be constructed. One formal, quantitative approach to this analysis uses Boolean algebra, in which a probability may be assigned to each underlying cause. This data then allows the likelihood of various system failures to be estimated.
Fault tree analysis and FMECA are generally regarded as complementing one another. Thus whereas FMECA may be considered as a “bottom-up” approach (starting at the component level and then determining the impact of component failures on the overall system), fault tree analysis is more of a “top-down” approach. Further details about FMECA and fault tree analysis are available in various textbooks such as: “Reliability Analysis for Engineers: An Introduction” by Roger D Leitch, Oxford Science Publications, 1995, ISBN 019 856371 X.
Although FMECA and fault tree analysis are well-established techniques, they are often seen purely as abstract design tools, somewhat disconnected from the real world development of a product itself. Sometimes reliability analysis is just performed as a “tick-in-the-box” type requirement during the development phase, with only marginal relevance to the actual product. The reliability analysis is often then filed and forgotten about during the subsequent operational lifetime of the product.
Nevertheless, system reliability remains extremely important. This is especially true in the computing field, where system crashes, freezes, bugs and other failures are worryingly common. In the article “Self-Repairing Computers” by Fox and Patterson, Scientific American, June 2003, pages 42–48, various strategies for combating this unreliability are discussed, particularly for software.
One approach discussed in the article is to monitor components involved in various operations on the system, and to determine on a statistical basis using data mining techniques those (sub)components that may be responsible for any observed failures. This approach eschews the use of any prior knowledge about the system architecture in order to maximise flexibility. On the other hand, such a philosophy also makes the diagnosis problem much harder (if not impossible), and certainly more time-consuming.
Another approach discussed in the article is the provision of an “Undo” command to restore the system to an earlier, presumably correct, status. Unfortunately, this strategy is not effective against (persistent) hardware faults. In many situations, it may be difficult to ascertain whether a particular failure is caused by a software or hardware malfunction.
Thus although there are a variety of known strategies to improve reliability for computer systems, both at run-time and also through the design process, to date these have met with only limited success and application.