1. Technical Field
This invention relates to real-time computing systems and more specifically to a method for recovering a real-time computer system after a transient radiation event such as that caused by a cosmic ray striking a trace within a semiconductor device.
2. Background Art
Several types of radiation are known to induce adverse effects within microelectronic devices. It is well known that the energetic particles in space radiation, including protons and heavy ions, can cause anomalies in electronic equipment, such as flight critical computers, onboard satellites, spacecraft and aerial vehicles flying at high altitudes. A single energetic particle can deposit sufficient charge in an integrated circuit to change the state of internal storage elements and may also cause more complex internal behavior. The observable changes of state can include bit-flips, latched power conditions, and short circuits. A bit-flip, such as from logic ‘0’ to logic ‘1’, is an example of a transient fault mode known as a single event upset (SEU). A latched power condition is an example of a potentially catastrophic fault mode known as a single event latch-up (SEL). A short circuit in an integrated circuit typically results in a hard failure, which is typically mitigated by redundant circuitry. In order to protect a spacecraft against a single flight critical computer failure, redundant computers are typically employed.
FIG. 1 shows a triple redundant flight critical computer architecture that might typically be used onboard a spacecraft. Redundant sensors 15 provide inputs to a first computer 11, a second computer 12, and a third computer 13. Each computer includes inputs 17, processing 18, and outputs 19. The outputs from each of the redundant computers 11-13 are routed to redundant sensors 16. In addition, the redundant computers 11-13 are interconnected with a cross-channel data link (CCDL) 14 which allows the computers to interchange data.
It is important that each flight critical computer, such as a spacecraft navigation computer, is able to detect and recover from both an SEU and an SEL because an undetected transient fault can possibly diverge to a hard failure. It is known in the art to recover a flight critical computer via re-initialization schemes, for example by cycling power (on-off-on). Although cycling power to the computer clears SEU or SEL induced errors, it also results in a period of time when the computer is not available for such tasks as spacecraft stabilization.
FIG. 2 depicts a block diagram of a typical flight critical computer that may be found onboard a spacecraft. The computer includes a central processing unit (CPU) 21 connected by an address bus 22 and a data bus 23 to a nonvolatile memory (NVM) 24, a memory-mapped input/output controller 25, and random access memory (RAM) hardware modules 26-28. The input/output controller 25 is connected to inputs 17 using, for example, analog to digital converters, and to outputs 19 using, for example, relay drivers. An operational flight program (OFP) is loaded during an initialization sequence to the random memory modules 26-28 and is then executed.
A known method of recovering a flight critical computer in a radiation environment, without accessing another computer via the CCDL 14 and re-initialing, is to save the state data of the computer, stored in random access memory (RAM) 26-28, to a radiation-hardened temporary storage while executing the operational flight program (OFP). When a single event upset (SEU) is detected, for example by a parity check, data is dumped from the temporary storage back to the random access memory (RAM) 26-28. However, this prior art approach fails to insure data integrity in the temporary storage and also can cause the redundant computers 11-13 to lose synchronization. In addition, the recovered computer needs to be re-admitted before the next SEU occurs and there can be critical phases of a mission during which re-initialization is not practical, such as initializing a de-orbit burn.
Recently, much effort has been devoted to the research and design of fault-tolerant control systems. The primary interest of this research is the application of fault detection, identification and reconfiguration (FDIR) to control systems. The common objective of FDIR in flight control systems is to prevent the loss of a vehicle due to faulty sensors, faulty actuators, and damaged control surface installations. A typical FDIR design aims to prevent the flight critical computer 11 from using erroneous data from faulty sensors 15 and from sending out improper control to faulty actuators 16. A typical flight critical computer includes both a fault detection and identification algorithm and a fault accommodation/reconfiguration algorithm. The fault detection and identification algorithm, typically embedded in the control laws, is devised to detect faulty components in the various control loops. The reconfiguration algorithm is typically a vehicle-specific adaptive mechanism that cancels the adverse effects caused by faulty components, such as ‘hard-over’ actuators.
There is a long felt need for an autonomous rapid recovery mechanism for a flight-critical computer from a radiation event. Such a mechanism should augment the physical redundancy and fault detection, identification and reconfiguration (FDIR) already available in the art.