The following description is provided to assist the understanding of the reader. None of the information provided or references cited is admitted to be prior art to the present invention.
Certain environments require that computer systems in use be extremely reliable. At the same time, some of these environments may be extremely harsh, exposing computer components to potentially catastrophic elements.
One such environment is the space environment. Computer systems that may be placed in space, such as in Earth orbit, are not available for regular maintenance and must, therefore, be guaranteed to perform for the life of the spacecraft. Thus, a computer system mounted on a spacecraft must be highly reliable and be robust in its tolerance to faults, either internal or external.
Further, objects in the space environment are subject to various types of radiation that may be extremely harmful to certain computer components. For example, a single radiation element may cause an upset, referred to as a single-event upset (SEU), of either a processor or a memory in a computer system. A computer in the space environment should desirably be tolerant to such single event upsets.
Developing computer components that are individually tolerant to such upsets can be extremely expensive and inefficient. Foremost, due to the long development cycles, such components generally lack the performance of the state-of-the-art components. For example, a processor designed to be radiation tolerant may be two years old by the time the development is complete. In those two years, the state of the art in processors may have more than doubled the performance of processors. Further, hardening such components against faults may make the components poor in cost-effectiveness.
U.S. Pat. No. 5,903,717 discloses a computer system for detecting and correcting errors from SEUs. The system includes a plurality of processors (CPUs) whose outputs are voted at each clock cycle. Any CPU output signal which does not agree with a majority of the CPU output signals results in an error signal being produced. The system reacts to the error signals by generating a system management interrupt. In reaction to the system management interrupt resulting from a detected error, software initiates re-synchronization of the plurality of CPUs when the error is caused by a single-event upset.