I. Field of the Invention
This invention relates generally to a method and apparatus for increasing the reliability of digital computing systems, and more particularly to an improved error recovery technique in complex digital computer based control systems.
II. Discussion of the Prior Art
As the reliability and performance of digital computers have steadily improved, they have become the predominant technology for control systems. For example, in avionics systems, digital computers are now considered for such concepts as fly-by-wire or autoland in which a digital computer is interposed between the pilot and the aircraft control surfaces. In such systems, the safety of the aircraft depends on continued faultless performance of the digital control system.
To achieve the requisite reliability in such applications, prior art systems have typically resorted to hardware redundancy with majority voting. Typically, a minimum of three identical processing elements are configured to perform identical control computations on identical data inputs. In the event a disparity occurs between the multiple processing elements, the minority result(s) is ignored and one of the majority results is used to effect the control function. A majority-of-three voting allows the system to tolerate a single failure while majority-of-five allows the system to tolerate up to two failures. In more elaborate systems, a persistently failing unit may be voted out of the system and replaced with a so-called "hot spare".
Various forms of redundancy synchronization, voting and fault monitoring have been proposed or employed in the past. However, while it is recognized that these mechanisms influence the type of recovery technique a particular system may employ, this invention addresses only a manner of recovery and not these other mechanisms.
A fault condition may be induced by a permanent failure of the digital circuitry, in which case it is considered a "hard fault". Alternatively, it may be induced by a transient phenomena which results in an incorrect result, but does not, in fact, damage or alter the subsequent operation of the circuitry. This condition is classified as a "soft fault" or "soft error". Soft errors may be induced by various transient conditions, such as electromagnetic interference (EMI), inherent noise, lightning, electromagnetic pulses (EMP), or high energy radio frequency (HERF). In this specification, the term "external transient noise condition" is intended to include all of the above sources.
As newer digital technology is introduced, the trend seems to be that the amount of energy necessary to change the state of a memory element or a logic element is reduced, thereby making these elements more susceptible to upset due to EMI, lightning, EMP, or HERF. With careful design, the susceptibility of the control system circuitry to such transient conditions can be greatly reduced. However, it is generally not practical to harden the entire control system to the extremes of any and all interference which may be encountered in an avionics application. It is, therefore, important that safety-critical digital control systems be able to tolerate transient upsets without affecting the performance of the critical application.
Recognizing that transient upsets will occur in certain applications, it is highly desirable to provide an error recovery mechanism whereby a processing element can be restored to a functional condition following a soft error. With an error recovery mechanism, the hardware redundancy can be reserved to cover only the hard error conditions, thereby increasing the overall system reliability. The error recovery mechanism further provides a clear diagnostic distinction between soft and hard errors and, therefore, practically eliminates the incidences of unconfirmed removal whereby a functionally good processing element is falsely suspected of being faulty due to upsets.
One well-known method of effecting soft error recovery is to design the control algorithm around a computational frame where the time interval of the computational frame generally corresponds to the sampling interval of the control system. When a soft error is detected, the computer is reset to a known state corresponding to the beginning of the current computational frame and the control algorithm is restarted. Soft error recovery in a majority-voting redundant system imposes yet another constraint. The faulty computing element must not only be rolled back to the beginning of the computational frame, but must also be restarted in synchrony with the other computing elements. While it is recognized that the relationships between the redundancy elements within a system impact the operation of the system, this invention defines a method of fault recovery and does not address redundancy management concerns.
Prior art redundant digital systems have provided recovery from transient disturbances to the digital circuitry with a "transfusion" of data. Following the detection of a digital circuit upset, "transfused" data is transmitted from the unaffected redundant digital circuits to the upset digital circuits in order to restore the upset circuitry to the identical state as the unaffected circuits. For example, in a system comprised of three digital processing lanes, if one lane is determined to be faulty by the voting logic, it is isolated. The remaining two lanes then transmit the current state variables necessary to complete the recovery. The isolated lane then resumes processing in synchrony with the active lanes and is readmitted to the system after performing a given number of computation frames without another detected fault. It is possible that a single transient disturbance might cause an upset in each of the redundant processing elements. In this instance, the redundancy is defeated and the system crashes, since there is no source of valid data available for transfusion.
A further limitation of prior art approaches is that the software required to perform the data transfusion, in the event of a fault, is specific to both the application being performed and the level of redundancy and the redundant architecture employed in the system.
It is highly desirable, and in some instances imperative, that fault-tolerant techniques such as soft-error recovery be totally transparent to the software and independent of the context of the control application.