The present invention relates to digital computer systems, and more particularly, to a computer system having an apparatus for monitoring redundant processors of the computer system to effectuate a recovery from correctable faults of the multi-lane digital computer system.
Digital computers are utilized to implement complex banking and business systems as well as in the control of industrial processes. The digital computer is also finding wide spread usage in the control of vehicles such as aircraft, spacecraft, marine and land vehicles. For example, in present day automatic flight control systems for commercial and military transports, the digital computer is supplanting the analog computer of earlier technology.
Automatic flight control systems are constrained by Federal Air Regulations to provide safe control of the aircraft throughout the regimes in which the automatic flight control system is utilized. Any failure condition which prevents continued safe flight and landing must be extremely improbable. Present day regulations require a probability of less than 10.sup.-9 failures per hour for flight critical components. A flight critical portion of an automatic flight control system is one, the failure of which will endanger the lives of the persons aboard the aircraft. For example, components of an automatic flight control system utilized in automatically landing the aircraft may be designated as flight critical, whereas, certain components utilized during cruise control may be designated as non-critical.
In the present day technology of digital automatic flight control systems, it is generally recognized that a digital computer including the hardware and extensive software required for a flight control system application program is of such complexity that the analysis for certification in accordance with Federal Air Regulations is exceedingly more time consuming, expensive and difficult than with the analog computer. The level of complexity and sophistication of the digital technology is increasing such that analysis and proof for certification to the stringent safety requirements is becoming more difficult. Also, it is becoming more difficult to identify all possible data paths in such systems and therefore conventional failure mode and effects analysis is essentially ineffective.
Present day digital computers include hundreds of thousands of discrete semi-conductor or integrated circuit elements generically denoted as latches. A latch, well known to those skilled in the art, is a high speed electronic device that can rapidly switch between stable states in response to relatively low amplitude, high speed signals. Latch circuits are utilized to construct most of the internal hardware of a digital computer such as logic arrays, memories, registers, control circuits, counters, arithmetic and logic units, and the like.
As a consequence, digital computers are subject to disturbances which upset the digital circuitry but do not cause permanent physical damage. For instance, since present day digital computers operate at nanosecond and subnanosecond speeds, rapidly changing electronic signals normally flow through the computer circuits, such signals radiating electro-magnetic fields that couple to circuits in the vicinity thereof. These signals can not only set desired latches into desired states, but can also set other latches into undesired states. An erroneously set latch can unacceptably compromise the data processed by the computer or can completely disrupt the data processing flow thereof. Functional error modes without component damage in digital computer based systems is denoted as digital system upset.
Digital system upset can also result from spurious electromagnetic signals such as those caused by lightning that can be induced on the internal electrical cables throughout the aircraft. Such transient spurious signals can propagate to internal digital circuitry setting latches into erroneous states. Additionally, power surges, radar pulses, static discharges and radiation from nuclear weapon detonation may also result in digital system upset. When subject to such conditions, electrical transients are induced on system lines and data buses resulting in logic state changes that prevent the system from performing as intended after the transient. Additionally, such electromagnetic transients can penetrate into the memory of the computer and scramble the data stored therein. Since such transients can be induced on wiring throughout an aerospace vehicle, reliability functions based on the use of redundant electronic equipment can also be comprised.
A digital computer is susceptible to complete disruption if an incorrect result is stored in any of the memory elements associated with the computer. These upsets increase the number of unconfirmed removals and adversely affect the MTBF/MTBUR, mean time between failure/mean time between unconfirmed removals, ratio of the computer. Safety-critical digital avionic computer applications such as fly-by-wire or autoland achieve their very high reliability numbers through the use of redundancy and cannot tolerate system upset due to transient conditions such as electromagnetic interference (EMI), inherent noise, lightning, electromagnetic pulses (EMP), high energy radio frequency (HERF) or transient radiation effects on electronics (TREE). Safety-critical digital computers must be able to tolerate such transient upsets without affecting the performance of the critical application. As newer digital technologies are introduced, the amount of energy necessary to change the state of a memory element is rapidly dropping, thereby making these elements more susceptible to upset due to EMI, lightning, EMP, HERF or TREE.
Prior redundant digital systems have provided recovery from transient disturbances to digital circuitry with a "transfusion" of data. Following the detection of a digital circuit upset, "transfused" data was transmitted from unaffected redundant digital circuits to the upset digital circuits in order to restore the upset circuitry to the correct processing state. Such a prior redundant system is comprised of three digital processing elements. If one element is determined to be faulty by the other two, it was isolated. The two remaining elements then transmitted the current long term state variables, which are specific to the digital application, to the isolated element. The isolated element then resumed processing and was readmitted to the system if the processing was performed for a given number of computation frames without another detected fault.
Thus, prior systems do not provide for the recovery of digital processing elements if all processing elements are upset by a transient disturbance. Prior systems depend on the interaction between remaining undisturbed redundant components and the upset components for error identification and correction.
Thus, there is provided by the present invention, a system which allows the digital processing system to recover from disruption of all redundant elements. Each digital processing element is able to recover from "soft" hardware faults or other "transient" type disturbances by restoring its own previous non-faulty state without interaction with other redundant elements within the digital system. Because this technique is transparent to the software function being executed on a digital processor and is independent of the level of redundancy of the digital components, system design flexibility is increased over the present day systems.
The disadvantages of the present system, are overcome by maintaining copies of present and past values of system state variables which are protected from the effects of electromagnetic environments, or other external disturbances (e.g. nuclear particles, etc.). This capability provides a needed level of immunity for critical digital avionics (such as autoland, displays or fly-by-wire) to such external environments. Specifically, following disruption of a digital processing element, an appropriate copy of each state variable, provided by that processing element, is restored from the protected memory associated with that element. With the state variable re-established at their legitimate values, data processing by the disrupted processing element is restored. In addition to protected memory, the present invention includes various monitoring options.