It has been estimated by Lin and Siewiorek in “Error Log Analysis: Statistical Modeling and Heuristic Trend Analysis”, IEEE Transactions on Reliability, Vol. 39, No. 4, 1990 that about 90% of the crashes experienced by computing systems are due to intermittent and transient faults. It has also been determined that most of the permanent faults are preceded by intermittent faults. The rates of occurrence of intermittent faults are expected to increase as transistor and interconnect dimensions shrink (see for example, C. Constantinescu, “Impact of Deep Submicron Technology on Dependability of VLSI Circuits”, Proc. of the International Conference on Dependable Systems and Networks, Washington, D.C., USA, 2002). Early detection of failure prone circuits or subsystems such as processors, memory, interconnects, input/output channels and devices significantly improves availability of computing systems. Isolation of a failing component before a crash occurs allows scheduling of preventive maintenance, seamless activation of a spare, or graceful degradation (if spares are not available).
Conventional failure prediction mechanisms rely on the counting of errors that occur within a component or a subsystem. A failure is considered eminent when the number of errors reaches a predetermined threshold over a given period of time. As a result, the component is isolated and further action is taken (for instance a spare is activated, followed by replacement of the failing part). This scheme is also known as “leaky bucket” and was initially used for traffic control in asynchronous transfer mode networks (see for example, A. W. Berger et al. in “Performance Characteristics of Traffic Monitoring, and Associated Control, Mechanisms for Broadband Packet Networks”, IEEE Global Telecommunications Conference, Vol. 1, 1990). The main problem with this type of approach is that errors in predicting failures can easily occur. For instance, a system crash can occur before the error threshold is reached, due to spikes in the error rate, separated by a relatively long period of time with no errors. Such a behavior is common in the case of intermittent faults experienced in VLSI circuits. If the error threshold is set to a very low value, to avoid the previous scenario, a good component may be replaced due to a small number of transient errors, induced by environmental conditions.