A fault-tolerant system is usually designed to handle most of the faults in the system by using concepts such as redundancy. An uncovered component fault may lead to a system or subsystem failure even when adequate redundancy exists. Automatic recovery and reconstruction mechanisms, including fault detection, location, and isolation, play an important role in implementing fault tolerance. The models that consider the effects of imperfect fault coverage are known as imperfect coverage models (IPCM), or simply coverage models (CM).
According to the types of fault-tolerant techniques used in the error handling mechanism, coverage models are generally classified into as component level fault models and system level reliability/dependability modes. The component level fault models are used for a particular behavior of a system in response to a fault in each component. If the identification and recovery process of a fault component utilizes its built-in test (BIT) capability, it is called an element-level coverage model.
In the element-level coverage model, if a fault of a component is not covered, it may lead to a system failure and called a single-point failure (or uncovered failure).
A literature introducing the conventional imperfect coverage models is provided in NPL 1.
The conventional imperfect coverage model, especially, the element-level coverage model, only considers the identification and isolation of faulty components. A common assumption is that any component may result in a single-point failure to the system if it has not been safely isolated from the system.
Given a system, it may contain some irrelevant components whose status (operational or failed but covered) does not affect the system if the fault coverage is perfect. A component could be initially irrelevant in the system, and an initially relevant component could become irrelevant afterwards due to the reconfiguration caused by the failures of other components. However, if the fault coverage is imperfect, an uncovered fault of the irrelevant component may still lead to a system failure and become a single-point failure. In such a case, it is important to identify and isolate the irrelevant components in addition to the faulty components, which can significantly improve system reliability by preventing the potential future uncovered failures from the irrelevant components.