The natural radiation environment on Earth and in space can often cause short term and long term degradation of semiconductor devices used in computers. This hazard is a problem for computers where fault-free operation is required. In addition to these radiation effects, computer chips are subject to random failures due to undetected defects and weaknesses that evolve over the course of time. Trace radioactive materials in semiconductor packages may also cause faults. When computers must operate for long periods in a remote environment, or where these devices must operate without fault for long periods of time, the need for systems which are protected from faults or failure becomes critical. Remote or vulnerable environments include remote oil platforms, submarines, aircraft and isolated sites such as Antarctica. Systems that operate in Earth orbit and beyond are especially vulnerable to this radiation hazard.
The presence of cosmic rays and particularly high energy particles in space near the Van Allen radiation belt can produce a disturbance called a single event effect (SEE) or a single event upset (SEU). The magnetic field of the Earth deflects particles and changes their energy levels and attributes. The Earth's magnetic field also traps charged particles that travel from the Sun and other stars toward the Earth. Some particles that are not trapped by the Earth's magnetic field are steered by that field into our atmosphere near the poles. These particles can penetrate the electronic devices aboard satellites.
When high energy particles and gamma rays penetrate a semiconductor device, they deposit charge within the computer circuit and create transients and/or noise. This phenomenon can "upset" the memory circuits. One type of upset occurs when a single bit of data stored in the chip's memory changes its value due to radiation. In this instance, a logical value of "one" can change to a logical value of "zero" and vice versa. An upset may be generally defined as a mis-stated output of a component. This output may comprise one or more signal bits.
Radiation can also induce a "latchup" of circuits in a chip. Latchup is an electrical condition of a semiconductor in which the output of the device is driven and held at saturation because of the deposition of charge within a semiconductor circuit by the high energy particles. The cause of the latched condition may be only a temporary upset. If power is removed then reapplied, the component may function normally.
The upset rate of a component depends on the construction features of the chip, including its size, operating voltage, temperature and internal circuit design. The upset rate for a particular part can vary from ten per day for a commercial one megabit random access memory chip (RAM), to 1 every 2800 years for a radiation-hardened one megabit RAM. A radiation-hardened component is a device that has been specially designed and built to resist the hazards of radiation. These devices tend to be much more expensive and slower than conventional chips. They typically lag the state-of-the-art by one to three years.
Current computer chips that are utilized in conventional applications on the ground are generally not threatened by cosmic radiation. This immunity is due to the protection offered by the Earth's atmosphere. There are, however, some terrestrial uses of computer chips that are subject to radiation upsets. Trace radioactive material in semiconductor packages can cause an upset. Radiation emitted from diagnostic or therapeutic medical devices can similarly affect semiconductor components. As devices become more complex, secondary and tertiary particles from atmospheric cosmic ray penetration will cause them to suffer upsets.
In their paper entitled Review of Commercial Spacecraft Anomalies and Single-Event-Effect Occurrences, Catherine Barillot et al. describe the upset events that have been observed in space since 1975. The events and their origins are traced and analyzed. Data are presented which show that the number of upsets encountered on the TDRS satellite follows the modulation of cosmic rays with the solar cycle.
L. D. Akers of the University of Colorado published a paper concerning upsets entitled Microprocessor Technology and Single Event Upset Susceptibility. The author points out that current satellites which employ powerful microcircuits to control every aspect of a spacecraft are increasingly vulnerable to heavy ion induced SEU. He predicts that the advent of microdevices having lower power and higher speed, combined with the expected increase of particles from large solar flares, will result in much higher rates of SEUs. He believes that the designers of small satellites will need to implement SEU mitigation techniques to ensure the success of future satellite missions.
Previous attempts to mitigate the radiation hazards that affect computer chips have met with mixed results. Work relating to fault tolerant computers has principally dealt with error detection at a high level, for example, at the register level. In their paper entitled Synchronization and Fault-Masking in Redundant Real-Time Systems, IEEE, 1984, pp. 152-157, C. M. Krishna et al. describe hardware synchronization and software synchronization of a number of phase-locked clocks in the presence of "malicious" failures. The authors describe a simple hardware voting strategy in which the output values of a clock are compared with the incoming signal of a reference clock. Non-faulty clocks are locked in phase. As processors fail, they are replaced by spares if they are available. This method applies to many redundant computers having multiple clocks which operate in close synchrony. Krishna et al. also describe the use of software algorithms to enable a system of many processors with their own clocks to operate in close synchrony.
The software solutions like those utilized by Krishna et al. employ voting procedures at software block levels. These solutions generally involve comparing computer outputs at a high level to see if each separate computer agrees with the others. Such systems pay a heavy price in weight, bulk, cost and power consumed to achieve high levels of redundancy.
Krishna et al. do not address the problem of momentary upset of a system. Nor have the authors addressed the problem of faults limited to within any one component of a processor. The recognition of a fault in a system, such as that described by Krishna et al., means the entire device has failed. But a radiation upset does not necessarily result in a failed device. The upset condition can be temporary.
In a paper entitled Single Event Upset and Latchup Sensitive Devices in Satellite Systems published by The Johns Hopkins University Applied Physics Laboratory, Richard M. Maurer and James D. Kinnison recognize the hazard of single event upset and latchup. They offer a decision tree as an aid to eliminating single event effects sensitive parts from a design, or using SEE sensitive parts "as-is" to provide some measure of protection in the design of circuits in which the parts will function.
In their article on Reliability Modeling and Analysis of General Modular Redundant Systems, published in IEEE Transactions on Reliability, Vol. R-24, No. Dec. 5, 1975, Francis Mather and Paulo T. de Sousa explain that hardware redundancy has been used to design fault-tolerant digital systems. They describe majority voting of redundant modules and quadded logic (replacement of every hardware gate by four gates) as hardware redundant structures.
E. J. McClusky published a paper entitled Hardware Fault Tolerance, in the Sixteenth Annual Institute in Computer Science at the University of California at Santa Cruz, Aug. 25, 1986. McClusky describes the basic concepts and techniques of hardware fault tolerancing. One such technique is "error masking," the ability to prevent errors from occurring at system outputs. Error masking is achieved, according to McClusky, with "massive redundancy." System outputs are determined by the voting of signals that are identical when no failures are present. The usual forms of massive redundancy are triple-modular redundancy, quad components, quadded and voted logic. McClusky reports that voted logic involves connecting all copies of a module to a voter. The outputs of each module are passed through the voter before being transmitted to other parts of the system. Voting is carried on at high level in the entire system. Quadded logic is described as replacing every logic gate with four gates. Faults are automatically corrected by the interconnection pattern of the gates. Such a system would clearly incur weight, power and cost penalties on the system that is being protected from radiation hazards.
While McClusky suggests that triple-modular redundancy can be applied to small units of replication as well as an entire computer, he does not describe how such a scheme might be implemented, except for the use of error correcting codes and certain software programs. Error correcting code methods rely on error correcting circuitry to change faulty information bits and is, therefore, only effective when the error correcting circuitry is fault-free. The software methods cited by McClusky require that several versions of a program be written independently. Each program runs on the same data and the outputs are obtained by voting. Such a technique may be effective for temporary faults, but requires a great deal of time and system overhead.
H. Schmidt et al. discuss the numerous critical issues which must be resolved prior to a detailed design of a reconfigurable computer, such as computers used for real time control systems in Critical Issues in the Design of a Reconfigurable Control Computer, IEEE, 1984, pp. 36-41.
In his paper entitled Fault Tolerant Multiprocessor Link and Bus Network Architectures, published in the IEEE Transactions on Computers, Vol. 34, No. 1, Jan. 1985, pp 33-45, Dhiraj K. Pardha presents a general class of regular networks which provide optimal or near optimal fault tolerance for a large number of computing elements interconnected in an integrated system.
Earlier high performance processors comprised a number of logic chips, a floating point chip and many memory chips used as local caches. Current processors contain all of these fuinctions in a single chip. This centralization of functions within a single chip permits the application of fault-tolerant methods to just a few chips in a processor system at the chip hardware level. As more and more devices are contained on one substrate, the processor chips become more and more dense. These devices, particularly complementary metal oxide, gallium-arsenide, and bipolar semiconductors devices and others, are then increasingly affected by radiation.
In their book entitled Reliable Computer Systems, Second Edition, published by Digital Press in 1992, Daniel P. Siewiorek and Robert S. Swarz discuss error detection, protective redundancy, fault tolerant software and the evaluation criteria involved in reliability techniques. Chapter Three of this text presents a comparison of computer output at the system level, register or transfer level, bus level module level and gate level. The authors describe triple-redundant modules plus voting that isolates or corrects fault effects before they reach module outputs. They also discuss use of back-up spares in a hybrid redundant system. That is, a core of N-modules operating in parallel, with a voter determining system output and with a set of back-up spare modules that can be switched in to replace failed modules in the core. FIG. 3-31 of this text depicts majority voting at the outputs of three module and/or three voters. Siewiorek et al. aver that this technique results in signal delay and decreases in performance. FIG. 3-57 shows the fault tolerant computer of Hopkins, Smith and Lala (1978) implemented from a set of processor/cache, memory and input/output modules, all interconnected by redundant, common serial buses. The computations of the computer are performed in triads: three processor/caches and three memories performing the same operation in voting mode and synchronized at the clock level. Because most processing utilizes the cache, voting is not performed at every clock cycle, but whenever data is transferred over the bus. The authors do not describe a system that includes multiple processors coupled by individual buses to a voter, which has a voter output connected to a single memory. Siewiorek and Swarz do not describe a system whose processor outputs and inputs are voted at each clock cycle. The authors do not discuss means for controlling power to dysfunctional processors as part of such a system.
The development of a fault tolerant computer based on commercially available parts for use in military and commercial space vehicles would offer significant operational and cost advantages. Such an invention would offer higher levels of performance and would cost less to manufacture than existing approaches based on radiation hardened chips. The invention could be used for remotely installed computer systems and other processors that are subjected to random failures or to a radiation environment which produces single event upsets at unacceptably high rates. Such radiation upset protection would discover and correct errors. It would be extremely beneficial if a fault tolerance method could be applied at a very low hardware level, for example, within a processor chip, instead of at the computer register or the output of computer modules. Such a system would fill a long felt need in specialized computer and satellite industries.