The present invention relates to fault tolerant computers and more specifically to a method and apparatus for operating in an error free manner when a microprocessor error is induced.
Three basic factors contributing to the functioning of a computer, and more specifically to a microprocessor or microprocessors included in a computer are power, performance and environment-induced radiation effects. New models or generations of computers seek to achieve higher performance at lower power levels. Additionally, in applications in which microprocessors are exposed to ionizing radiation, it is necessary to provide a mechanism for maintaining reliable operation when it is a virtual certainty that the ionizing radiation will cause processor errors. An example of applications in which sufficient levels of radiation will be encountered to cause errors in spaceborne computers. In applications in which particle or ionizing radiation is not present, errors can be caused by other fault mechanisms such as electrically induced noise pulses.
The most significant error events are Single Event Upset (SEU) and Single Event Functional Interrupt (SEFI). SEU is defined by NASA as “radiation-induced errors in microelectronic circuits caused when charged particles (usually from the radiation belts or from cosmic rays) lose energy by ionizing the medium through which they pass, leaving behind a wake of electron-hole pairs. In other words SEU is a change of state or transient induced by an energetic particle such as a cosmic ray or proton in a device. SEUs are “soft errors” in that a reset or rewriting of the device causes normal device behavior thereafter. However, the error must be accounted for when it is included in data to be acted upon. An SEFI is a condition in which an SEU in the device's control circuitry places the device into a test mode, halt, or undefined state. The SEFI halts normal operations, and is believed to require a power reset to recover. SEU error rates in a nominal application for commercial microprocessors can range from 0.2 to 9 MeV/mg/cm2. This range of rates is reflected in processor performance, depending on the processor and its environment, from a quite acceptable single upset per year to an unacceptable multiple upsets per hour.
Improved SEU performance when designing microprocessor systems commonly results in increased power consumption. However, this technique does not solve the problem of SEUs and SEFIs due to radiation or electrically induced noise pulses. One prior art approach comprises utilizing radiation hardened microprocessors which will not be susceptible to the errors induced by radiation. However, radiation hardened microprocessors are not available in state of the art versions. They have over the past ten years lagged non-hardened processors by two to three generations. For example, currently available radiation hardened microprocessors include a 0.35 micron SOI (Silicon on Insulator) microprocessor and a 0.25 micron bulk CMOS on EPI processor (Complimentary Metal Oxide Semiconductor on Epitaxial Layer). However, state of the art microprocessors utilize 0.13 and 0.10 geometries. Radiation hardened microprocessors also lag the state of the art in terms of MIPS (Million Instructions Per Second) capability.
Another known technique is TMR, triple modular redundancy, applied at the system level, also known as spatial redundancy. Three individual or discrete processors run instructions in parallel and synchronously. The outputs of the processors are sent to a comparator that utilizes voting logic. When an SEU occurs in one processor, the other two processors will still produce matching outputs. The comparator will pass the majority output. SEFI errors are treated as SEUs. However, the processor experiencing the SEFI will remain offline until reset or otherwise corrected. TMR triples the processor power requirements compared to a single processor. Synchronizing the processors is difficult, and operation must be slowed with respect to the speed achievable by a single processor.
Time redundancy has been employed at the system level to provide the advantage of redundancy as described above while permitting the use of a single processor. In this technique, the processor executes the same instruction three times, or two times, comparing results, and runs a third time when the results do not agree. The result, or a checksum indicative of the result, is stored and the three stored outputs are compared. Three matching results indicate the absence of an SEU. If there is an SEU, a voting circuit selects the correct result. When the SEU corrupts data, the time redundancy technique will operate correctly. However, if the SEU causes an instruction to be corrupted, the technique will not operate correctly. A bit instructing a wrong operation will cause the wrong operation to be performed all three times. SEUs are not detected and SEFIs are not corrected. An improved form of time redundancy was developed by the Stanford Advanced Research and Global Observations Satellite Project (ARGOS). This technique is described in Oh, N., P. P. Shirvani and E. J. McCluskey, “Error Detection by Duplicated Instructions In Super-scalar Processors,” IEEE Transactions on Reliability, Vo. 49, No. 7, September 2001, pp. 273-284. Many errors were corrected, but still others were not.
Another prior art alternative is to build a processor using commercial, non-radiation hardened integrated circuit process and apply known RHBD (radiation hardness by design) techniques to improve radiation hardness. Once again, as in the case of radiation hardened processors, die area is increased and operating speed are compromised. Also, while commercial switching logic utilizes simple flip-flops, RHBD logic requires latches built out of many flip-flops and further logic such as inverters. Performance comparable to commercial processors which are not radiation hardened is not provided.
Examples of an improved radiation hardened system and a time redundant system are respectively disclosed in my copending patent application Ser. No. 10/435,626 filed May 6, 2003 entitled Fault Tolerant Computer and Ser. No. 10/656,720 (with a coinventor) filed Sep. 8, 2003 entitled Functional Interrupt Mitigation for Fault Tolerant Computer, the disclosures of which are incorporated by reference herein. It is desirable to provide a system in which a minimal amount of radiation hardening need be done. It is desirable to provide a system in which a time redundant system is also made space redundant, but in an efficient, reliable manner. For example, it is desired to avoid the problem of synchronizing a plurality of processors.
There is little patent literature on SEFIs. Many testing efforts with microprocessors do not report SEFIs, or “hangs.” It is probable that all microprocessors will exhibit SEFIs whether they have been previously observed or not. This will include both commercial and radiation hardened devices. SEUs may take place in any transistor within a complex microprocessor. When the upset occurs in a memory location, whether a register or memory site, this can be measured and corrected. However, when the upset occurs in more subtle ways, the processor may be placed in a state from which it is not recoverable. An example is the case of an induced error in combinatorial logic or in state-machine transistors. It may be initially impossible to observe an error condition within the processor. However, the error may propagate within combinatorial logic. Other unrecoverable faults could include illegal branching, upset induced exceptions, upsets in the program counter or other unobservable faults. Work by such researchers as Dr. James W. Howard of Jackson and Tull Chartered Engineers of Washington, D.C. has demonstrated that SEFIs will occur in Pentium®, PowerPC and other processors. It is highly probable that all microprocessors will exhibit SEFIs whether they have been previously observed or not. It is therefor highly desirable to provide a way of detecting SEFIs so they may be responded to and also providing a way of responding to them.