The natural radiation environment on Earth and in space can often cause short term and long term degradation of the semiconductor devices that are used in computers. This hazard is a problem for computers where fault-free operation is required. In addition to these radiation effects, computer chips are subject to random failures due to undetected defects and weaknesses that evolve over the course of time. Trace radioactive materials in semiconductor packages may also cause faults. When computers must operate for long periods in a remote environment, or where these devices must operate without fault for long periods of time, the need for systems which are protected from faults or failure becomes critical. Remote or vulnerable environments include remote oil platforms, submarines, aircraft and isolated sites such as Antarctica. Systems that operate in Earth orbit and beyond are especially vulnerable to this radiation hazard.
The presence of cosmic rays, and particularly high energy particles in space near the Van Allen radiation belt, can produce a disturbance called a single event effect (SEE) or a single event upset (SEU). The magnetic field of the Earth deflects particles and changes their energy levels and attributes. The Earth's magnetic field also traps charged particles that travel from the Sun and other stars toward the Earth.
Some particles that are not trapped by the Earth's magnetic field are steered by that field into our atmosphere near the poles. These particles can penetrate the electronic devices aboard satellites.
When high energy particles and gamma rays penetrate a semiconductor device, they deposit charge within the computer circuit and create transients and/or noise. This can upset the memory circuits and induce a "latchup" of circuits on the chip. An upset may be generally defined as a mis-stated output of a component. This output may comprise one or more signal bits. Latchup is an electrical condition of a semiconductor in which the output of the device is driven and held at saturation because of the deposition of charge within a semiconductor circuit by the high energy particles. Devices based on complementary metal oxide semiconductor architectures (CMOS) are some of the most likely to be affected. A CMOS device comprises two NPN devices on the same substrate which share the same P channel. Latchup occurs when the stray charge starts a current in a first NPN device. The current is fed back to the other NPN device. If the gain of the circuit is greater than unity as a result of the feedback loop, the device moves to one state continuously and is said to be in latchup. This condition can cause a short between power and ground, local heating, migration of the semiconductor material and can eventually destroy the device. The correction of errors caused by device latchup usually involves reduction or removal of power to a processing unit or other component to prevent catastrophic damage that could result from a latched condition. The cause of the latched condition may be only a temporary upset. When power is reapplied, the component may function normally.
The upset rate of a component depends on the construction features of the chip, including size of the chip and internal circuit design. The upset rate for a particular part can vary from ten per day for a commercial, one megabit random access memory chip (RAM), to one every 2,800 years for a radiation-hardened one megabit RAM. A radiation-hardened component is a device that has been specially designed and built to resist the hazards of radiation. These devices tend to be much more expensive and slower than conventional chips. They generally tend to lag the state of the art by several years.
Current computer chips that are utilized in conventional applications on the ground are generally not threatened by cosmic radiation. This immunity is due to the protection offered by the Earth's atmosphere. There are, however, some terrestrial uses of computer chips that are subject to radiation upsets. Radiation emitted from diagnostic or therapeutic medical devices can affect semiconductor components. As devices become more complex, secondary and tertiary particles from atmospheric cosmic ray penetration will cause them to suffer upsets.
In their paper entitled Review of Commercial Spacecraft Anomalies and Single-Event-Effect Occurrences, Catherine Barillot et al. describe the upset events that have been observed in space since 1975. The events and their origins are traced and analyzed. Data are presented which show that the number of upsets encountered on the TDRS satellite follows the modulation of cosmic rays with the solar cycle.
L. D. Akers of the University of Colorado published a paper entitled Microprocessor Technology and Single Event Upset Susceptibility. The author points out that current satellites, which employ powerful microcircuits to control every aspect of a spacecraft, are increasingly vulnerable to heavy ion induced SEU. He predicts that the advent of microdevices having lower power and higher speed combined with the expected increase of particles from large solar flares will result in much higher rates of SEUs. He believes that the designers of small satellites will need to implement SEU mitigation techniques to ensure the success of future satellite missions.
A publication sponsored by NASA, entitled Single Event Criticality Analysis, Feb. 15, 1996, written by Allan Johnston, describes SEUs and related effects such as "latchup" in electronic devices caused by the passage of high energy particles. He points out the difficulty in overcoming the latchup at the system or subsystem level by sensing excess current, which is the telltale signature of a latchup. This difficulty arises because power must be removed from the affected component within milliseconds. Many different latchup paths and current signatures exist in complex circuits.
Johnston reports that high-energy protons and heavy ions found in radiation environments on Earth and in space lose energy as they pass through materials. This effect is primarily caused by ionization processes. The particles deposit a dense charge as they pass through an electronic component's P-N junction. Some of this charge will be collected at the junction contacts. Charge can also be collected from outside the junction. The net effect is a very short duration current pulse at the internal circuit node which is struck by the particle. A large fraction of the total charge collected by the circuit node occurs in about 200 picoseconds. If the charge collected from the particle strike exceeds the minimum charge required for the component to switch states, for example from non-conducting to conducting, then the passage of the particle will upset or otherwise affect the circuit. The minimum or "critical charge" depends on the design of the specific device which is struck. Several effects can be induced in integrated circuits by high-energy ion strikes:
(1) transient effects, such as single-event upsets and multiple-bit upsets, that change the state of internal storage elements, but which can be simply reset to normal operation; PA1 (2) potentially catastrophic events, such as single-event latchup, that may cause destruction of a component unless quickly corrected; and PA1 (3) single-event hard errors, which cause catastrophic failure of a single internal transistor within a complex circuit. PA1 1. The majority voted signal is used by the agreeing CPUs to continue CPU processing operations without interruption; if the CPU disagreement persists, a latchup condition may be indicated and the disagreeing CPU is powered down, then re-powered; PA1 2. The disagreeing CPU is disabled from further participation in voting; PA1 3. A system management interrupt (SMI) is generated to the other CPUs; and PA1 4. At a later time, software initiates a re-synchronization process that recovers the disabled CPU.
Most junction-isolated integrated circuits contain parasitic, bipolar transistors that can form a four-layer region similar to that of a silicon controlled rectifier. These bipolar structures are not involved in normal operation of a CMOS device. They can be triggered by transient currents. All CMOS designs use special guardbands and clamp circuits at the input/output (I/O) terminals to prevent latchup in standard applications. However, in a radiation environment, transient signals are no longer confined to I/O terminals. It is possible for the current pulses from heavy ions or protons to trigger latchup in the internal region of the CMOS device as well as in I/O circuitry. Once latchup occurs, the four-layer region will be switched into a conducting state. It will remain in that state until the voltage in the latched region is reduced to a very low value. During latchup, currents can be very high. This is a serious problem for space systems. Johnston points out the difficulty in overcoming the latchup of at the system or subsystem level by sensing excess current, which is the signature of a latchup, because power must be removed from the affected component within milliseconds to avoid possible catastrophic damage. Many different latchup paths and current signatures exist in complex circuits.
Previous attempts to mitigate the radiation hazards that affect computer chips have met with mixed results. Work relating to fault tolerant computers has principally dealt with error detection at a high level, for example, at the register level. In their paper entitled Synchronization and Fault-Masking in Redundant Real-Time Systems, IEEE, 1984, pp. 152-157, C. M. Krishna et al. describe hardware synchronization and software synchronization of a number of phase-locked clocks in the presence of "malicious" failures. The authors describe a simple hardware voting strategy in which the output values of a clock are compared with the incoming signal of a reference clock. Non-faulty clocks are locked in phase. As processors fail, they are replaced by spares if they are available. This method applies to many redundant computers having multiple clocks which operate in close synchrony. Krishna et al. also describe the use of software algorithms to enable a system of many processors with their own clocks to operate in close synchrony.
The software solutions like those utilized by Krishna et al. employ voting procedures at software block levels. These solutions generally involve comparing computer outputs at a high level to see if each separate computer agrees with the others. Such systems pay a heavy price in weight, bulk, cost and power consumed to achieve high levels of redundancy.
Krishna et al. do not address the problem of a momentary upset of a system. Nor have the authors addressed the problem of faults limited to within any one component of a processor. The recognition of a fault in a system, such as that described by Krishna et al., means the entire device has failed. However, a radiation upset does not necessarily result in a failed device. The upset condition may be temporary.
In a paper entitled Single Event Upset and Latchup Sensitive Devices in Satellite Systems published by The Johns Hopkins University Applied Physics Laboratory, Richard M. Maurer and James D. Kinnison recognize the hazard of single event upset and latchup. They offer a decision tree as an aid to eliminating single event effects sensitive parts from a design, or using SEE sensitive parts as-is to provide some measure of protection in the design of circuits in which the parts will function. Maurer and Kinnison presume that the latched state will have some distinctly different characteristics from the normal operating state, so that a latchup protection circuit can be designed. While avoiding the use of radiation-hardened devices, their method of hardware protection imposes weight, volume and power penalties. There may also be performance impacts on the device itself, especially with respect to the speed of operation.
In their article on Reliability Modeling and Analysis of General Modular Redundant Systems, published in IEEE Transactions on Reliability, Vol. R-24, No. 5, December 1975, Francis Mather and Paulo T. de Sousa explain that hardware redundancy has been used to design fault-tolerant digital systems. They describe majority voting of redundant modules and quadded logic (replacement of every hardware gate by four gates) as hardware redundant structures.
E. J. McClusky published a paper entitled Hardware Fault Tolerance, in the Sixteenth Annual Institute in Computer Science at the University of California at Santa Cruz, Aug. 25, 1986. McClusky describes the basic concepts and techniques of hardware fault tolerancing. One such technique is "error masking," the ability to prevent errors from occurring at system outputs. Error masking is achieved, according to McClusky, with "massive redundancy." System outputs are determined by the voting of signals that are identical when no failures are present. The usual forms of massive redundancy are triple-modular redundancy, quad components, quadded and voted logic. McClusky reports that voted logic involves connecting all copies of a module to a voter. The outputs of each module are passed through the voter before being transmitted to other parts of the system. Voting is carried on at high level in the entire system. Quadded logic is described as replacing every logic gate with four gates. Faults are automatically corrected by the interconnection pattern of the gates. Such a system would clearly incur weight, power and cost penalties on the system that is being protected from radiation hazards.
While McClusky suggests that triple-modular redundancy can be applied to small units of replication as well as an entire computer, he does not describe how such a scheme might be implemented, except for the use of error correcting codes and certain software programs. Error correcting code methods rely on error correcting circuitry to change faulty information bits and is, therefore, only effective when the error correcting circuitry is fault-free. The software methods cited by McClusky require that several versions of a program be written independently. Each program runs on the same data and the outputs are obtained by voting. Such a technique may be effective for temporary faults, but requires a great deal of time and system overhead.
H. Schmidt et al. discuss the numerous critical issues which must be resolved prior to a detailed design of a reconfigurable computer, such as computers used for real time control systems, in Critical Issues in the Design of a Reconfigurable Control Computer published by the IEEE, 1984, pp. 36-41.
In his paper entitled Fault Tolerant Multiprocessor Link and Bus Network Architectures, published in the IEEE Transactions on Computers, Vol. 34, No. 1, January 1985, pp 33-45, Dhiraj K. Pardha presents a general class of regular networks which provide optimal or near optimal fault tolerance for a large number of computing elements interconnected in an integrated system.
Earlier high performance processors comprised a number of logic chips, a floating point chip and many memory chips used as local caches. Current processors contain all of these functions in a single chip. This centralization of functions within a single chip permits the application of fault-tolerant methods to just a few chips in a processor system at the chip hardware level. As more and more devices are contained on one substrate, the processor chips become more and more dense. These devices, particularly complementary metal oxide, gallium-arsenide, and bipolar semiconductors devices and others, are then increasingly affected by radiation.
In their book entitled Reliable Computer Systems, Second Edition, published by Digital Press in 1992, Daniel P. Siewiorek and Robert S. Swarz discuss error detection, protective redundancy, fault tolerant software and the evaluation criteria involved in reliability techniques. Chapter Three of this text presents a comparison of computer output at the system level, register or transfer level, bus level module level and gate level. The authors describe triple-redundant modules plus voting that isolates or corrects fault effects before they reach module outputs. They also discuss use of back-up spares in a hybrid redundant system. That is, a core of N-modules operating in parallel, with a voter determining system output and with a set of back-up spare modules that can be switched in to replace failed modules in the core. FIGS. 3-31 of this text depicts majority voting at the outputs of three module and/or three voters. Siewiorek et al. aver that this technique results in signal delay and decreases in performance. FIGS. 3-57 shows the fault tolerant computer of Hopkins, Smith and Lala (1978) implemented from a set of processor/cache, memory and input/output modules, all interconnected by redundant, common serial buses. The computations of the computer are performed in triads: three processor/caches and three memories performing the same operation in voting mode and synchronized at the clock level. Because most processing utilizes the cache, voting is not performed at every clock cycle, but whenever data is transferred over the bus. The authors do not describe a system that includes multiple processors coupled by individual buses to a voter, which has a voter output connected to a single memory. Siewiorek and Swarz do not describe a system whose processor outputs and inputs are voted at each clock cycle. The authors do not discuss means for controlling power to dysfunctional processors as part of such a system.
The development of a fault tolerant computer based on commercially available parts, for use in military and commercial space vehicles, that would prevent permanent damage from latchup would offer significant operational and cost advantages. Such an invention would offer higher levels of performance and would cost less to manufacture than existing approaches based on radiation hardened chips. The invention could be used for remotely installed computer systems and other processors that are subject to random failures or to a radiation environment which produces single event upsets at unacceptably high rates. Such radiation upset protection would discover and correct errors. This fault protection system would provide a means to power off or power down affected processors without interfering with a running software application. It would be extremely beneficial if a fault tolerance method could be applied at a very low hardware level, for example, within a processor chip, instead of at the computer register or the output of computer modules. Such a system would satisfy a long felt need in specialized computer and satellite industries.