1. Field of the Invention
The invention relates to a method for selecting and responding to faults in a fault processing apparatus, in particular, to a method for determining and responding to faults which can cause the fault processing apparatus to fail if the fault processing of the apparatus is unavailable.
2. Description of the Prior Art
The increasing level of complexity in systems requiring high-level reliability can make the determination of system dependability a daunting task. Reliability, safety, availability, and maintainability are examples of dependability metrics that are typically of interest.
One of the central attributes of a dependable system is the probability of properly handling a fault within the system, given that a fault has occurred. This attribute is commonly referred to as the fault coverage conditional probability or simply, fault coverage. The reliability, safety, and availability of a system are all heavily influenced by the fault coverage value. For example, if the fault coverage of a system is 1, then a system is defined to always operate in a safe manner. Typically, fault coverage refers to a fault-tolerant system which has some type of active reconfiguration; that is, a system which contains spare or redundant modules or components which can be brought on-line when a fault is detected in a currently active module.
In comparison to fault-tolerant systems, fail-safe systems also have a fault coverage probability which determines the overall system safety, but this class of system is typically interested in fault detection only. A fault is considered as covered in a fail-safe system if it is detected and the system is configured to be properly shutdown. Thus, the fault coverage probability for fault-tolerant systems and fail-safe systems both include the probability that a fault will properly be detected and processed by the system. For this reason, the term fault processing is used herein to include the concepts of both fault-tolerance and fail-safe.
Previous fault injection techniques can be placed into three general categories: (1) fault injection at the hardware level on an operational prototype; (2) fault injection at the software level on an operational prototype of a fault processing system; and (3) fault injection into a simulation of the fault processing system. Fault injection has been used over the past decade to evaluate various performance characteristics of a fault processing system.
The accepted method for measuring the fault coverage of a fault processing system is the evaluation of system fault processing capability using a stochastic approach, i.e., via the random injection of faults. The general approach to these methods is to inject a fault into the system to see if the fault processing portion of the system correctly handles the fault. The apparatus under test (AUT) can be a system simulation or an operational system. These techniques, while providing insight into the behavior and types of faults that a fault processing system can sustain, lack the statistical foundation for high-reliability dependability validation. One drawback of randomly selected faults is the large probability that the injected fault will remain latent. The fault injection experiments that have a fault latency percentage of 50% or greater generally deal with transient fault injection experiments. It is well known that 80% or more of all faults which occur on ground-based computer applications are transient in nature. All of these methods view the system as a black box, and little analysis is performed to determine which faults lead to failures for the system without fault processing.
Generally, prior art fault detection methods determine fault processing system fault coverage via fault injection by using random selection of fault location, fault type, and random selection of fault insertion. A few techniques employ a Failure-Modes-and-Effects Analysis (FMEA) to determine the location of faults. Usually, FMEA is performed at a very high level and relies heavily on engineering intuition. Additionally, FMEA usually does not provide any information on the fault type or fault timing for the fault list of interest.
For fault injection to be effective in measuring the fault coverage of an ultra-reliable, life-critical fault processing system, some method for determining the location, value, and timing associated with the fault to be injected is essential. The purpose of fault injection for determining fault coverage is to inject a fault into the system which has a high probability of producing a failure in the absence of system fault processing. Specifically, the ideal injected fault should always cause a system failure if the system's fault processing is not utilized. It is believed that earlier fault injection methods have not performed this type of deterministic analysis. For validation of a high-reliability system to be feasible utilizing a fault injection technique, a method for determining the faults which lead to system failures in the absence of system fault processing is needed, so that the system can be selectively reconfigured, when possible, to improve system fault coverage.