The present invention relates generally to computer systems devoted for safety-critical and critical-control applications. More particularly, the present invention relates to hybrid multiple redundant systems that combine majority voting with fault diagnostic and fault recovering means to provide correct outputs of a system in the presence of multiple system component faults.
The evolution of the computer has opened the door to widespread automation. Increasingly, computer systems handle critical tasks, such as, industrial control of oil, gas, nuclear, and chemical operations, patient monitoring, aircraft flight control, and military systems among others. Within these systems, emergency shutdown systems are used in safety-critical applications to monitor processes and remove the process to a safe state when selected process variables fall outside of a safe range. As one example, in an oil refinery, the pressure of an air compressor unit expander and its temperature are monitored and shutdown actions are taken if these reach an upset condition. In this example, the emergency shutdown system is designed to protect the process separately from the basic process control system. Critical control systems, on the other hand, provide both continuous control and protection for many safety-critical applications like gas and steam turbines, boilers, and off-shore platforms. In a gas or steam turbine, for example, the critical control system provides non-stop speed control as well as start-up and shutdown sequencing in a single integrated system. In all of the above examples, and other related industries, improved technologies add complexity and increase production output, making reliance on emergency shutdown systems and critical control systems increasingly important. Computer systems devoted for safe and critical control applications must have extremely high degrees of safety and reliability since faults in computer systems can cause vast economic loses and endangers human beings. A system failure that occurs as a result of a system component fault can be either safe or dangerous. A safe failure occurs when a system has failed into a safe state, or in other words, where the system does not disrupt the operation of other systems or compromise the safety of personnel associated with the system. The safe failure occurs, for example, when an emergency shutdown system (ESD) fails in such a way that it causes a shutdown not associated with the controlled process. A dangerous failure is a failure that prevents the system from responding to hazardous situations, allowing hazards to develop. For instance, a dangerous failure occurs when the ESD cannot perform a required shutdown.
Redundant configuration of computing systems have been used in several research and designs to provide system fault tolerance, which is concerned with the continuation of correct operation of a system despite occurrence of internal faults. The spectrum of fault tolerance techniques can be divided into three major classes: static redundancy, dynamic redundancy, and hybrid redundancy.
Static redundancy provides fault tolerance without performing fault detection and recovery. One method of fault masking is through a voting process. Triple Modular Redundant (TMR) systems are the most common form of voting based systems. The conventional TMR system includes three identical controllers along with an output voter network that votes the outputs of the three controllers. See, e.g., Frederickson A. A., xe2x80x9cA Hybrid multiple redundant Programmable Controllers for Safety Systemsxe2x80x9d, @ ISA Transactions, Vol. 29, No. 2 (1990) pp. 13-17. Each controller usually includes an input module, a main processor module, and an output module. By using three identical controllers in combination with the voter network, any single computer fault is masked by the 2-of-3 voting, so any single fault does not lead to the system failure.
In many cases, however, two concurrent faults lead to a system failure. For example, if the TMR system is used as an Emergency Shutdown system, its outputs will be in ON condition under normal operation and in an OFF state for a shutdown. If, for example, two output modules of the TMR fail at the same time, in such a way that their outputs remain in an OFF condition, then the system fails safely, making a false shutdown. On the other hand, when two output modules fail in such a way that their outputs remain in an ON state, it can lead to a dangerous system failure. This failure is termed dangerous because, despite a process problem, the process cannot shut down.
To compensate for the TMRs inability to tolerate more than one controller failure, quick fault detection must be used to minimize the period of time that the system operates in a vulnerable condition. Commercial versions of TMR offer online module replacement and repair capability to address this problem. However, if one controller of the TMR fails and it has not been replaced, a fault in another controller can lead to a system safe or dangerous failure. Thus, the success of online repair depends on the user ability to discover and diagnose the problem in a short time period. Since fault discovery and repair rate are limited by many reasons, even a single controller failure may bring the system to a vulnerable mode.
As an alternative method of compensating for this vulnerability, known devices employ an output hot spare in an attempt to overcome the problem. That system has two triplicate I/O modules in parallel, where one module, a primary, is active, while the other modulexe2x80x94a hot spare is powered but inactive. Each output module usually includes three identical legs located in a single board. Under normal operation, hot spare module outputs are OFF so they do not affect the system output. If a fault is detected on the primary module, the control is automatically switched to the hot spare module, allowing the system to maintain 2-of-3 voting continuously. The faulty module can then be removed and replaced without process interruption.
The hot spare method reduces the probability of a safe failure within a TMR system. For example, when a safe failure occurs in any leg of primary output module that is discovered and the hot spare outputs are passed to the ON state allowing the system to maintain energized condition of system outputs. However, employing a hot spare adds to the number of components in the system increasing the overall system cost. As a further disadvantage, the hot spare is useless when the outputs of faulty modules remain in an ON state, and, thus, cannot prevent the occurrence of a dangerous system failure.
To tolerate additional concurrent faults, known devices can add replicate computers within the voting scheme. For example, a five modular redundant system (5MR) would perform three-out-of-five voting in order to tolerate two faults. Unfortunately, the 5MR system requires large additional resources, which significantly increases size and weight of the system, making it very expensive to implement.
Turning away from static redundancy methods, Dynamic redundancy methods achieve fault tolerance by detecting the existence of faults and performing system reconfiguration to prevent a system failure. Dynamic redundancy systems have built-in fault detection capability. When a fault is detected, the system is usually reconfigured by activating a spare processor or computer. The most common example is the use of dual computer system that includes primary and spare computers operating in parallel. The system also includes a central diagnostic module that monitors primary computer and switch-over system output to the spare computer when the primary computer fails. The Dual Dynamic Redundancy (DDR) system has, therefore, three independent components: two computers and the Central Diagnostic Module (CDM) and it tolerates any single component failure. The DDR system, however, cannot operate properly in the presence of two concurrent faults. If two computers of the DDR fail at the same time, the entire system will fail too.
An alternative approach of the dynamic redundancy is used in industrial control systems (2oo2D system) such as described in the book Goble, W. M. xe2x80x9cControl Systems Safety Evaluation and Reliabilityxe2x80x9d, @ ISA (1998) pp. 364-375. The 2oo2D system contains two programmable controllers operating in parallel. Each controller has an independent diagnostic module that opens a special output switch, de-energizing controller outputs in the event that the controller fails. Possible dangerous failure(s), therefore, are converted into safe failure(s). In the event that one controller fails in such a way that its outputs remain de-energized, the system will still operate via a second controller avoiding a false shutdown.
However, fault tolerance and reliability of the 2oo2D system depends in great part on the fault coverage, which is defined as the probability that a failure will be detected/recovered if it occurs. In contrast to the voting based system, the 2oo2D system has no property of fault masking so in the event that undetectable fault occurs the system can fail too. As well as the TMR, the 2oo2D system can tolerate only one component failure.
Another dynamic technique, disclosed in the U.S. Pat. No. 5,956,474, utilizes triple modular pairs (FIG. 8), each of which includes a computer element connected to an input/output (I/O) controller for providing system fault resilient and fault tolerance. The I/O controller in each pair periodically synchronizes and monitors operation of computing elements by software comparing their output data and checksums and disables a faulty computing element if its fault is discovered. Although this approach, for some kind of computing failures, providing normal system operation when all but one computing elements fails, the I/O controller is not fault tolerant. Consequently, it may produce false outputs if two I/O controllers simultaneously fail. Therefore, this system can tolerate only a single component failure.
Recognizing the shortcomings of static and dynamic systems, Hybrid redundant systems or HTMR systems combine the attractive features of both static and dynamic approaches. Fault masking is used to prevent the system from producing erroneous outputs, and fault detection and system reconfiguration are used in case of a fault. A conventional hybrid reconfiguration includes a triple module redundant system with an output voting circuit and a plurality of stand-by computers. The system further includes fault detecting and switching circuits that locate the faulty computer among the active computers and isolate the faulty computer from the triple module redundant system. In this way, the faulty computer is substituted by one of the stand-by computers. See E. G., Pradhan, D. K., xe2x80x9cFault-Tolerant Computer System Design,xe2x80x9d Prentice Hall PTR 1996, pp. 19-21. Since conventional hybrid configuration requires large addition modules, it makes it very expensive to implement. Another drawback of the conventional hybrid system is that it is not possible to avoid additional faults within the fault detecting, switching, or voting circuits.
Another hybrid redundant system disclosed in the U.S. Pat. No. 5,084,878, utilizes multiple computer subsystems each of which includes self-diagnostic and cross-diagnostic means at the processor level. Results of the diagnostic means, together with outputs of the subsystems, are connected to a switch matrix that selects the correct output of each subsystem and passes these outputs to a majority voter. This system provides fault tolerance with respect to detectable subsystem faults and, as a result of its diagnostic means, it is able to operate normally even with all but one subsystem is still healthy. Unfortunately, this system is still susceptible to a fault occurring in the switch matrix or majority voter. In particular, the system uses a common output selecting circuit consisting of a switch matrix and the output voter. Thus, it may fail when either the switch matrix or majority voter fails. The system reliability, therefore, is greatly dependent on the switch matrix and voter reliability. Moreover, use of the switch matrix limits the practical number of inputs that this system is capable of receiving. To implement the system for critical control applications, such as industrial control or emergency shutdown systems, the system should operate with many hundred, if not thousands, of inputs and outputs. To accommodate these outputs, the switch matrix and output voter complexity grows rapidly. Consequently, the reliability of each of these components eventually dominates the reliability of the system.
In many critical-control and critical-computation applications, at least two faults must be tolerated to provide the required reliability. Therefore, there is a need for a hybrid multiple redundant computer system that remains operational in the presence of two concurrent faults occurred in any system components.
An object of the present invention is, therefore, to provide a hybrid multiple redundant computer system, which comprises at least three redundant processing units that combine majority voting with fault diagnostics and fault recovery means configured such that any two concurrent faults occurred in any two system components will not lead to a system failure.
It is another object of the present invention to provide a hybrid multiple redundant computer system comprising three redundant processing units that is able to operate normally in the presence of multiple component faults and remain operational if at least one processing unit is still healthy.
It is another object of the invention to provide a hybrid multiple redundant system comprising three redundant processing units that is able fail safely in the event that three processing units concurrently fail.
It is another object of the invention to provide a hybrid multiple redundant system comprising three redundant processing units that is modified from 2-of-3 voting configuration to the 2-of-2 configuration in the presence of one central processor module (CPM) fault and to the 1 -of-1 configuration in the presence of two CPM faults.
It is another object of the invention to provide a hybrid multiple redundant system comprising three redundant processing units that is able fail safely in the event that three central processor modules concurrently fail.
It is another object of the invention to provide a hybrid multiple redundant system comprising three processing units that remains operational in the event that two out of three input modules concurrently fail.
It is another object of the invention to provide a hybrid multiple redundant system comprising three processing units that remains operational in the event that two out of three output modules concurrently fail.
It is another object of the invention to provide a hybrid multiple redundant system comprising three processing units that is able fail safely in the event that three output modules concurrently fail.
The present invention generally provides a redundant hybrid multiple redundant computer system having three data processing units, operating in parallel, that combine majority voting with fault diagnostic and fault recovery means configured such that any two concurrent faults occurring in any two system components will not lead to a system failure. Briefly, and in general terms, the invention comprises first, second, and third processing units operating in parallel, each of which includes a central processor module connected to an input and an output module. The central processor module receives input data from the associated input module and uses this data as input to a control program that provides output data for two output modules in such a manner that the central processor module associated with the first processing unit transmits output data to the associated output module and to the output module associated with the second processing unit, and the central processor module associated with the second processing unit transmits output data to the associated output module and to the output module associated with the third processing unit, the central processor module associated with the third processing unit transmits output data to the associated output module and to the output module associated with the first processing unit.
Input modules of the system are connected in parallel and each input module receives the same input data from multiple field sensors and other devices providing system inputs. Output modules of the system operate in parallel with each other providing system outputs that correspond to the system inputs. The system of the invention also includes watchdog controller in each processing unit for detecting the occurrence a fault within the central processor module and transferring an alarm signal to each output module in the event that this central processor module fails.
The output module, having no single point of failure, produces its output as a logical product of output data received from the associated central processor module and from the other central processor modules in the absence of an alarm signal in each processing unit. It disables its output if an alarm signal is received from the associated watchdog controller. The output module generates output is using only the output data received from the associated central processor module if at least one out of two alarm signals produced by the neighbor watchdog controllers is activated. The output module in the first processing unit, for example, produces its output as a logical product of output data received from the first central processor module and from the central processor module associated with the third processing unit if alarm signal in each processing unit is not activated, it only uses output data of the first central processor module if at least one out of two alarm signals produced by watchdog controllers associated with second and third processing unit is activated, and it disables its output if alarm signal received from the watchdog controller associated with the first processing unit is activated.
In normal system operation, when alarm signals are not active, each output module produces its output as a logical product of output data received from two central processor modules. The system output is formed as the sum of logical products AC, AB, and BC, where A, B, and C are output data generated by central processor module associated with first, second, and third processing units, respectively. In normal operation, the system, therefore, performs a two-out-of-three vote among output data produced by the three central processor modules. This two-out-of-three voting increases system fault tolerance by masking transient faults that may be left undetected. This allows the system to produce correct outputs despite an undetected failure caused by a central processor module fault.
In the event that a central processor module fails and its fault is discovered, the associated watchdog controller activates the alarm signal that disables outputs of the associated output module. This alarm signal also instructs each of the other output modules to generate output corresponding only to output data produced by its associated central processor module. In the presence of one faulty central processor module, it allows the system to reconfigure from the triple module configuration to the two-out-of-three (2oo2D) dual redundant configuration. In the presence of two faulty central processor modules, outputs of both associated output modules are disabled, and the system still operates with one healthy central processor module. The system, therefore, still operates properly in the presence of faults in any two central processor modules. Further, the system provides means for setting all system outputs to a predetermined safe condition in the event when each central processor module fails.
Each output module of the invention also provides fault tolerance capability that allows the system to generate correct outputs in the presence of up to two faults within the output module. The output module in each processing unit comprises a primary output circuit, a secondary output circuit and an output voter network. In each output module, primary and secondary output circuits receive output data from the associated central processor module and from certain other central processor module respectively. The output voter network is connected to outputs of the associated primary and secondary output circuit for producing a logical product of these outputs on the output of this output module. Hence, the output voter network generates this output as a logical product of output data received from two corresponding central processor modules. Outputs of three output voter networks are connected together for providing two-out-of-three voting among output data of three central processor modules.
In each output module, the primary output circuit is further connected to the associated watchdog controller for receiving an alarm signal only from this watchdog controller. In contrast, the secondary output circuit in the same output module is connected to the two other watchdog controllers and receives alarm signals from both of them. Primary and secondary output circuits perform selectable but different logical functions with output data received from the corresponding central processor modules and alarm signals.
The output voter network comprises multiple parallel operating pairs of first and second electronic valves connected in series, such that each pair is connected on one side to an external power supply and on the other side to the output of the associated output module. In normal operation, the primary output circuit in each output module controls first electronic valves, while the second output circuit in the same output module controls second electronic valves. The primary output circuit sets each first electronic valve ON or OFF in accordance with output data of the associated central processor module if an alarm signal produced by the associated watchdog controller is not activated. In the event that the associated central processor module fails and its associated alarm signal is activated, the primary output circuit disables all outputs of the associated output voter network by setting all first electronic valves OFF. At the same time, the secondary output circuit in each other output module also receives the alarm signal produced by the watchdog controller associated with the faulty central processor module. The secondary output circuit sets each second electronic valve ON or OFF in accordance with output data received from the corresponding central processor module if no alarm signals are activated. In the event that any alarm signal is activated, the secondary output circuit transfers control over the output voter network to the primary output circuit in the same output module. Hence, if one central processor module fails, each output module that is not associated with the faulty central processor module will produce output by only using output data received from its associated central processor module. In the presence of one faulty central processor module, the system, therefore, is reconfigured from the triple modular configuration to the two-out-of-two (2oo2D) dual redundant configuration.
Each output module further includes a fault recovery valve connected by one side to the external power supply and connected by other side to each pair of the electronic valves in a point where these pairs are connected together. The fault recovery valve is driven by the associated watchdog controller for disabling all outputs of the associated output module in the event that either associated primary output circuit and/or output voter network fails. Three central processor modules periodically communicate to each other for on-line testing of primary output circuits, secondary output circuits, and output voter networks at the same time. Each central processor module has means for reading the status of the associated primary output circuit and the secondary output circuit in the same output module. In the event that either the primary output circuit or both primary and secondary output circuits in the same output module fail, the associated central processor module commands the associated watchdog controller to set the associated fault recovery valve OFF via the alarm signal disabling each output of the associated output module from the system outputs. In that case, the system will operate with two healthy output modules, again changing to the 2oo2D dual redundant configuration. In the event that two primary output circuits fail concurrently, the associated output modules are disabled via the associated fault recovery valves, but the system continues to operate with one healthy output module. In general, the system remains operational in the presence of any two faulty primary or secondary output circuits, as well as in the presence of concurrent faults in any pair of primary and secondary output circuits. In the event that each primary or each secondary circuit fails, the system provides fail-safe outputs because all fault recovery valves will be OFF.
All electronic valves in the output voter network are tested periodically by checking the ability of each valve to pass from ON to OFF and back according to the output data of the central processor module. Each output voter network further includes a current sensor in each pair of first and second electronic valves to monitor valve status. This current sensor is connected in series with the first and the second electronic valves and it is connected to the primary output circuit producing feedback data to inform the associated central processor module about current flowing through each pair of valves. Each central processor module reads the feedback data from the associated primary output circuit and compares these feedback data with the recent output data. If two electronic valves in the different output modules concurrently fail open, the corresponding outputs in these faulty output modules will be disabled, but the system will still operate properly with one healthy output module. In the event that any two electronic valves in series fails short, the respective central processor module will command the associated watchdog controller for activating alarm signal to set the associated fault recovery valve OFF for disabling each output of the associated output module from the system outputs. In that case, the system will operate with two healthy output modules, modifying to the 2oo2D dual redundant configuration. Therefore, the presence of any two faulty electronic valves will not lead to the system failure. Generally, the system is able to remain operational in the presence of any two component faults until these faults are detected.
The present invention further provides a system, as described above, where each processing unit further includes an OR-gate, the first input of which is connected to an additional output of the associated central processor module. The second input of the OR-gate is connected to the output of the associated watchdog controller. This embodiment allows the alarm signal on the output of the OR-gate to be activated if either the associated watchdog controller or the associated central processor module produces an alarm signal. The central processor module in this embodiment can produce an alarm signal upon detecting a fault within the associated output module. Further, the central processor module may produce an alarm signal when the associated watchdog controller fails.
Higher levels of the system fault tolerance substantially decreases the probability of both safe and dangerous system failures helping to prevent lost production due to false shutdown and providing much higher protection of personnel and equipment. Since many of the applications involve processes that are very expensive to shut down and start up, the computer systems, based upon the invention architecture, will provide a great economical benefit. More importantly, the emergency shutdown systems, based on the invention, reduce the likelihood of disabling injuries in critical industrial applications, such as in chemical and oil-refinery industries.