The natural radiation environment on Earth and in space can often cause short term and long term degradation of semiconductor devices used in computers. This hazard is a problem for computers where fault-free operation is required. In addition to these radiation effects, computer chips are subject to random failures due to undetected defects and weaknesses that evolve over the course of time. Trace radioactive materials in semiconductor packages may also cause faults.
When computers must operate for long periods in a remote environment, or where these devices must operate without fault for long periods of time, the need for systems that are protected from faults or failure becomes critical. Systems that operate in Earth orbit and beyond are especially vulnerable to this radiation hazard.
sThe presence of cosmic rays and particularly high-energy particles in space can produce a disturbance called a single event effect (SEE) or a single event upset (SEU). When high-energy particles penetrate a semiconductor device, they deposit charge within the computer circuit and create transients and/or noise. This phenomenon can “upset” the memory circuits. One type of upset occurs when a single bit of data stored in the chip's memory changes its value due to radiation. In this instance, a logical value of “one” can change to a logical value of “zero” and vice versa. An upset may be generally defined as a misstated output of a component. This output may comprise one or more signal bits.
The number and susceptibility to upset of the embedded storage elements drives computer transient fault rates. The upset rate of computer systems is dominated by unprotected main memory. Upsets in main memory can be protected by error correction codes (ECC) stored in added memory components. Once this effective technique is employed, the processors and associated “backside” caches become the predominant source of upsets.
Traditional approaches to improving system reliability attempt to prevent faults by design improvements, improved component quality and/or component shielding from environmental effects by radiation hardening. Radiation hardened devices, however, tend to be much more expensive and slower than conventional chips. They typically lag the state-of-the-art by several years.
Redundancy, at the computer level, is often used to improve system reliability as well. These highly redundant systems, however, are also very costly, due the number of components that are necessarily replicated.
Alternative approaches using redundancy at the processor component level can be very costly due to the added signal propagation delays introduced. These propagation delays force slowing of the speeds at which the processors can interact with system buses. This results in lower overall computer performance in throughput and IO bandwidth. The consequence to the overall system is a requisite greater number of redundant computer systems than in non-redundant systems. And in extreme cases, certain embedded applications cannot be fielded due to inability to meet very low latency computational requirements.
For some applications, such as operator critical systems, computer control systems must be able to operate reliably in the presence of multiple faults. These applications are not addressed by traditional voting methods which determine single signal output values, or sets of single signal output values, independent of the correctness of other related signal values. More sophisticated schemes are required such as consideration of majority agreement among entire processing engines.
In most instances, these applications also prevent propagation of errors beyond the fault detection and fault masking boundaries and into the main memory and I/O systems when correct operation is overwhelmed by multiple faults. Under these circumstances, computer control systems must be able to reliably halt and preclude perpetuation of faulty operation. Current state of the art processing element voting schemes do not provide reliable operation for this class of systems.
Accordingly, there is a need for a fault tolerant digital system that is capable of detecting faults, preventing their propagation through the system, and restoring proper operation to the faulty component(s), without significantly degrading the computational performance provided by unprotected, equivalent, commercial systems.