Real-time computer systems are often utilized in mission critical control systems where processed data is critical to maintaining human safety, environmental cleanliness, as well as the integrity of the equipment controlled thereby. For example, real-time computer systems are implemented to control the critical processes employed by oil extraction/refining processes, control of nuclear and chemical processing, aircraft control, patient monitoring in hospitals, and the control of military equipment, as well as others. As such, computing systems devoted to safety and critical controls must provide an extremely high degree of safety and reliability, since these computer systems carry out such processes that are susceptible to system faults or failures, which may result in significant economic losses, as well as exposing individuals to a potentially fatal event.
In particular, a system fault or failure is a physical defect, imperfection, or flaw that occurs within the hardware or software of the computer system. Such system faults can be classified as permanent if they cannot be corrected, or classified as transient if the faults appear and disappear within a short period of time. To prevent such system faults from leading to a complete failure of the computing system's operation, many computing systems operating in a mission critical environment provides fault tolerance, thus allowing the computing system to continue to perform its intended function in the presence of faults. For example, computer systems utilized in safety and critical industrial control are usually embodied using triple redundant module (TMR) architecture or a dual redundant architecture, such as a two-out-of-two (2oo2D) diagnostic architecture. The TMR systems usually combine self-diagnostics and fault recovery means with two-out-of-three voting to disable a faulty system component in the case of its permanent fault and to mask transient faults occurring in such components. The TMR system disclosed in U.S. Pat. No. 6,449,732 is able to operate properly in the presence of any single permanent or transient fault; however, in some cases this system may become inoperative when two concurrent system faults occur within system components such as output voter or power supply that can tolerate only a single fault.
In many safety-critical and critical-control applications where two faults or more must be tolerated, the TMR and 2oo2D systems are insufficient. In the hybrid triple redundant (HTR) computer systems, such as that disclosed in U.S. Pat. No. 6,732,300, includes three identical computing channels operating in parallel, and an output data voter that tolerates up to two faults of its components. Specifically, the HTR system remains operational in the presence of any two permanent faults and may continue to operate properly in the presence of certain kinds of three permanent faults. The HTR system is also able to detect and neutralize many transient faults, whereby the HTR disables the faulty channels, thus switching control of the output to an operational channel, while also automatically restoring system configuration after the disappearance of a transient fault.
Unfortunately, as in any TMR system, the HTR may fail in the event that all three of the HTR channels fail at the same time due to a common cause failure. One major source of a common cause failure is hardware and software design errors. For example, generic software design errors are a significant source of a computing system's failure, as the development of complex modern control systems increases the chances of software design errors. Since each computing channel of the TMR or HTR employs the same operating system, and executes the same control program, a single error or bug within the software can lead to the failure of each channel at the same time. In addition, environmental conditions may also be a source of common cause failures, such as in the case when the magnitude of the operating temperature exceeds the rating of the hardware components in each system channel, which lead to system failure. Thus, if redundant channels of the TMR or HTR are identical, they may suffer from the same design errors and react to environmental stress in exactly the same, but erroneous way that leads to system failure. In such event, the system becomes inoperable and must perform a false-trip, or a shutdown process that is not demanded by the process controlled by the computing system. Unplanned shutdowns may result in stopping production and therefore cause vast economic losses. Furthermore, such common cause failures may lead to a condition in which the control system will not be able to shutdown the controlled process when it is required. Such a condition is dangerous, as failure of the controlled system can cause the loss of human life and/or damage of expensive equipment.
Transient faults are characterized as those faults that appear randomly in hardware and/or software, which affect some computing components resulting in computation disruption but do not permanently damage the computing component. For example, transient faults may be caused by electromagnetic interference created by high-frequency signals or light propagating through the computing system into the communication lines and buses, causing memory elements to be set into erroneous logical states. Additionally, high-energy electromagnetic pulses created by power equipment when switched between ON and OFF states may also cause transient faults. Further, high-energy atomic particles, such as that created by cosmic radiation, may also deposit sufficient energy in the semiconductor elements to set electronic components, such as memories, into erroneous states. Thus, if the system channels are not physically isolated, the probability of a system failure increases due to the occurrence of transient failures, which can affect all system channels simultaneously or close together in time.
A known technique for enhancing computer system reliability is the utilization of multiple redundant channels having dissimilar software and hardware with respect to each other. An early example of the use of dissimilar redundancy is the space shuttle system, which uses five redundant computer channels with two dissimilar sets of software to overcome untested software errors or bugs that could lead to common cause system failure, while using three-out-of-five majority voting for masking transient faults. As such, the five computers maintained by the space shuttle are able to withstand any two permanent or transient failures, while still maintaining safe flight and operation. However, for industrial controls, the five computer redundant system requires use of additional input/output modules working with field sensors and actuators, which significantly increases the size and weight of the system, making it very expensive to implement.
Another redundancy technique which is used in an aircraft control system is referred to as a distributed central system (DCS), which is described in U.S. Pat. No. 6,860,452, whereby the controllers are split into first and second groups in a way that controllers of the first group have dissimilar software and hardware with respect to controllers of a second group for avoiding a common cause failure path. In this redundancy technique, flight control tasks are distributed among two or three controllers that belong to the different group, such that no single controller has exclusive control of the elevator, aileron, or the rudder. Furthermore, if the entire group of controllers fails and one controller in a neighboring group fails concurrently, the system still provides a pilot to manage flight control of the aircraft. As such, two-thirds of the system could fail and acceptable performance would still be achieved.
The deficiency of this type of distributed control system (DCS) is that the DCS does not provide sufficient protection against transient faults. For example, in U.S. Pat. No. 6,860,452, each controller of the system implements a set of control tasks that are similar but not equivalent to tasks implemented by other controllers so that the controllers are not able to use a conventional majority voting technique for outvoting possible erroneous results of the control tasks. Since each system task is implemented by two or three controllers, the occurrence of two concurrent transient faults can lead to failure of the system if a pilot is not fast enough to make appropriate corrections. Another negative feature of the DCS control architecture with respect to its use in industrial controls is that the DCS is a specifically-built system for use in flight control. Industrial redundant and non-redundant control systems are mostly programmable logic controllers (PLC) that are universal devices that allow end-users to download control program into PLC memory to implement required control tasks. In comparison to the DCS control system that distributes control tasks among controllers, the redundant PLC usually deploys controllers which implement the same control tasks synchronously with the neighboring controllers. Furthermore, the user can change the control program if it is required for the controlled object.
Another flight control system, such as that disclosed in U.S. Pat. No. 4,622,667, utilizes identical subsystems each of which includes three processing elements providing dissimilar data processing with respect each to other by using dissimilar software and/or dissimilar hardware. Cross-channel monitoring is included in each subsystem to identify disagreements between the outputs of the processing elements. This procedure allows the systems to detect a faulty element that is then disabled. In particular, the considered system may include nine processing elements arranged into three subsystems to allow the subsystems to remain operational in the presence of two faults of any type with respect to the processing elements. The subsystem, however, also includes three logic elements and two switches for disabling the processing element in the event that it fails. Logic “AND” elements maintained in each subsystem provides an output signal that disengages aircraft actuator equipment when two processing elements concurrently fail. However, if two logic “AND” elements related to different subsystems fail, it can lead to system failure since aircraft actuator equipment cannot be disconnected from the source of power. Another serious drawback of the system is that it is relatively complex and expensive to utilize nine computing elements, while three types of dissimilar software further increases expenses needed to develop software for the system.
Continuing, U.S. Pat. No. 6,367,031 discloses a flight control system that utilizes multiple core processing modules (CPM), each of which is capable of computing and executing aircraft control commands. The CPM includes two central processor units (CPU) connected to each other for comparing computation results in the first layer of the comparison. A second layer comparison is arranged by using Boolean “AND” elements that compare the results generated by the second layer comparison. The CPM utilizes identical software and hardware but the multiple core processing modules (CPM) are physically and electrically isolated from each other in separate computing cabinets so as to reduce the probability of common cause failure. Each comparison function is configured to detect a difference in CPM processing results and to disable a faulty CPM without affect to neighboring CPM operations. In particular, the considered system which includes three CPMs remains operational in the event that any two CPMs fail concurrently. The system, however, can fail in the event that two Boolean “AND” comparison elements related to different CPMs fail at the same time. The system can also fail due to possible software design error since all CPM utilize the same or similar software.
U.S. Pat. No. 6,813,527 discloses a control system architecture that comprises computing units having multiple processing units that each operates as a separate partitioned processing unit. Specifically, the system collects input data from sensors of the controlled plant as closed-loop feedback signals via a sensor adapter. Next, the processing units then compute output signals, which are monitored by an actuator adapter. The actuator adaptor is a processing device that provides an interface between the computing units and the actuators. When the actuator adaptor senses that one of the processing units is not supplying signals that lay within certain tolerances, the actuator adaptor transmits a signal to the computing unit that initiates a rapid fault recovery cycle for that processing unit. The system utilizes the conventional technique that involves computing a “mid-value” for output signals of the processing units. This mid-value is then compared to each output signal from each of the processing units for detecting a faulty processing unit. To accomplish rapid recovery in the event that a transient memory fault is detected, the faulty processing unit retrieves the necessary control and logic variables from an additional non-volatile random access memory that is immune from electromagnetic transients and other disturbances, which can affect the integrity of the memory. If the actuator adaptor senses a permanent fault, then the appropriate processing unit may be shutdown or isolated by the actuator adaptor. As such, the considered system provides rapid recovery from transient memory faults for each processing unit. However, in the event that the actuator adaptor fails, it can lead to system failure since a faulty adaptor will likely transfer the wrong output signal to the associated actuator.
U.S. Pat. No. 7,328,235 discloses multiple processing methods for fault tolerant control systems (FCS), which includes a plurality of processor nodes operating synchronously. During operation, the FCS collects input data from a data source, such as the sensors, by an input-processing node, which performs calculations on the basis of input data, and outputs the calculation results as output data to an output-processing node, such as an actuator adaptor. Each processing node includes means that allow it to detect its own fault and to acquire input data and calculation results from at least one normally operating node to recover after a fault. This method is suitable primarily for detecting transient faults; however, if the processing node suffers a permanent fault, it can still fail in many cases since its normal operation cannot be restored. The processing node can also fail due to a transient fault if the self-diagnostic fails at the same time. Another drawback of the considered method is that the input and output processing nodes have no protection against transient and permanent faults, resulting in the failure of the FCS in the event that either the input or the output processing node fails, thus producing erroneous input/output data.
The hybrid triple redundant (HTR) computer system 3, as shown in FIG. 1 and as disclosed in U.S. Pat. No. 6,732,300, includes three identical central processor modules (CPM) 11A, 11B, 11C, an input module 7, and an output module 70. Each input/output (I/O) module houses three I/O circuits respectively, whereby each input circuit 9 on the input module 7 reads the field data and passes that data to its respective central processor module (CPM) 11 over I/O buses 13A, 13B, and 13C. Three central processor modules 11 operate in parallel as the members of a triad, while the system 3 performs control functions on a cyclical basis. The period of an operation cycle is the scan time, which is primarily composed of the time required for I/O polling and the time required to execute the application program. Serial links 29A, 29b, and 29C are used by the CPMs 11a-c to communicate with each neighboring CPM 11 in read-only mode. Once per scan, the central processors 11a-c are synchronized, while each reads the input data and the diagnostic status of its neighboring CPM 11 module.
The CPMs 11a-c calculates the middle value among three sets of analog input data if it operates with the analog input module 7. The CPMs 11a-c performs two-out-of-three (2-out-of-3) software majority voting of the digital input data when it is operating with the digital input module 7. These techniques allow the system to mask possible input transient faults that would otherwise propagate into the calculations. The CPM 11 then executes the application program and sends output data generated by this program to the output module 70.
Continuing, the output module 70 includes three identical microcontrollers 20a-c, each of which communicates with the associated central processor modules 11a-c over the corresponding buses 13a-c to receive output data from the associated CPMs 11a-c. The output module 70 further includes three identical output circuits 32a, 32b, and 32c, each of which includes a logic circuit 42 and an output voter network 54. Each output voter network 54a-c consists of multiple pairs of electronic valves 53a1-c1 and 55a1-c1 that are connected in series to each controlled point and provides to each point a corresponding output 59. The associated outputs 59 are connected together, providing a system output 65 for the corresponding load 67. Each valve 53, 55 is controlled by the associated microcontroller 20 via the corresponding logic circuit 42. Each output voter circuit 54 also includes current sensors 57a1-c1, each of which is connected in series with the associated pair of valves 53 and 55. The current sensors 57 generate feedback signals over lines 47 to inform the associated CPM 11 via the microcontroller 20 about the current flowing through the valves 53, 55. Each voter network 54 also includes a fault recovery valve 56 that is in a normally ON state, but it can be passed to an OFF state in the event that the associated voter network fails.
The first output circuit 42a is connected to the first microcontroller 20a and to the third microcontroller 20c for producing its output 59a as a logical product of the output data generated by the CPM 11a and the CPM 11c. A second output circuit 42b is connected to the second microcontroller 20b and to the first microcontroller 20a for producing its output 59b as a logical product of the output data generated by the CPM 11b and the CPM 11a. A third output circuit 42c is connected to the third microcontroller 20c and to the second microcontroller 20b for producing its output 59c as a logical product of the output data generated by the CPM 11c and CPM 11b. The outputs 59a, 59b, and 59c are connected together for generating the system output 65 as a logical sum of the outputs produced by the output circuits 42a, 42b, and 42c, thereby providing two-out-of-three voting among the output data produced by the CPM 11a, CPM 11b, and CPM 11c in normal system operation. Each output module 70 further includes three identical watchdog controllers (WDC) 31 that are configured to detect faults within the associated microcontrollers 20 and in the associated CPM 11. Each WDC 31 is separately connected to each output circuit 42 to activate an alarm signal on the input of each output circuit 42 in the event that the WDC 31 detects a fault within the associated microcontroller 20 or in the associated CPM 11. WDC 31a, WDC 31b, and WDC 31c activates alarm signal wa, wb, and we respectively.
During normal operation of the prior art system 3, shown in FIG. 1, the output module 70 provides the system output 65 as a result of two-out-of-three voting among output data produced by the three CPMs 11 if alarm signals are not activated. In the event that one CPM 11 or the associated microcontroller 20 fails, the system is reverting from two-out-of-three voting to two-out-of-two voting that is produced by output module 70 with output data that it received from two healthy CPMs 11. In the event that two CPMs concurrently fails the system is reverted from two-out-of-two voting to one-out-of-one voting that output module 70 performs using output data received from one CPM that is still healthy.
With respect to faults that may occur in valves 53 and 55, the voter network 54 still provides the correct output 59 in the presence of faulty valve 53 or valve 55 or both of them if those valves are stuck in the OFF state. The microcontroller 20 monitors the status of each associated valve 53 and 55 during each scan by reading feedback data produced by the associated current sensor 57. In the event that the valves 53 and 55 in series are stuck in the ON state, the microcontroller 20 asserts a signal on the line 35 that drives the associated fault recovery valve 56 to the OFF state. As a result, the outputs 59 of the faulty voter network 54 are disconnected from the system outputs 65 to avoid system failure.
The HTR system, therefore, remains operational in the presence of any two permanent system faults. It also may continue to operate properly in presence of certain kinds of three permanent faults. The system performs two-out-of-three voting that allows the system to remain operational upon the occurrence of a single transient fault, even though this fault was not detected. The HTR system also includes a comprehensive diagnostic of all system components, allowing the system to provide correct output upon the occurrences of two transient faults. When a transient fault is detected, the HTR system disables the faulty circuit or module and switches control of the output to components that are still operating properly. The HTR, therefore, automatically restores system configuration after the disappearance of the transient faults. The HTR system, however, is not protected enough against the common cause failure, as the HTR will fail in the event that all three central processor modules fail at the same time due to a common cause failure.
As presented above, current computer systems are not capable of allowing the system to remain operational upon the occurrence of multiple faults in their components. Therefore, there is a need for a multiple redundant computer system utilizing dissimilar redundancy to provide uninterrupted system operation in the presence of multiple permanent and/or transient faults, as well as in the presence of common cause faults in the system components.