The present invention relates generally to computer systems devoted to safety-critical and critical-control applications. More particularly, the present invention relates to hybrid multiple redundant systems that combine majority voting with fault diagnostic and fault recovering means to provide correct outputs of a system in the presence of multiple system component faults.
Real time data acquisition and control systems often operate in mission critical applications where the computations are critical to human safety, environmental cleanliness, or equipment protection. Examples include industrial controllers, high-speed trains, nuclear power plants, military systems, and hospitals. Computing systems devoted to such applications must provide fault tolerance since faulty computations in these systems can cause the loss of human life and/or expensive equipment. Redundant configuration of computing systems has been used in several research and design projects to provide system fault tolerance, which is the ability of a system to continue to perform its task after the occurrence of faults. A system failure that occurs as a result of a system component fault can be either safe or dangerous. A safe failure occurs when a system has failed into a safe state, or in other words, where the system does not disrupt the operation of other systems or compromise the safety of personnel associated with the system. The safe failure occurs, for example, when an emergency shutdown system (ESD) fails in such a way that it causes a shutdown not associated with the controlled process. A dangerous failure is a failure that prevents the system from responding to hazardous situations, allowing hazards to develop. For instance, a dangerous failure occurs when the ESD cannot perform a required shutdown.
Most deployed critical control systems are based on either triple modular redundant (TMR) or dual redundant (DR) architecture to achieve fault tolerance and increase safety and reliability. Each of these systems, however, typically tolerates the fault of only one system resource. If, for example, the TMR system is used as an Emergency Shutdown system, its outputs will be in an ON condition under normal operation and in an OFF state for a shutdown. If, for instance, two output modules of the TMR fail at the same time, in such a way that their outputs remain in an OFF condition, then the system fails safely, making a false shutdown. On the other hand, when two output modules fail in such a way that their outputs remain in an ON state, it can lead to a dangerous system failure. This failure is termed dangerous because, despite a process problem, the process cannot shut down.
To compensate for the TMRs inability to tolerate more than one controller failure, quick fault detection must be used to minimize the period of time that the system operates in a vulnerable condition. Commercial versions of TMR offer online module replacement and repair capability to address this problem. However, if one controller of the TMR fails and it has not been replaced, the next controller fault can lead to a system safe or dangerous failure. Thus, the success of online repair depends on the user""s ability to discover and diagnose the problem in a short time period. Since fault discovery and repair rate are limited by many reasons, even a single controller failure may bring the system to a vulnerable mode.
As an alternative method of compensating for this vulnerability, known devices employ an output hot spare in an attempt to overcome the problem. That system has two triplicate I/O modules in parallel, where one module, a primary, is active, while the other module, a hot spare, is powered but inactive. Each output module usually includes three identical legs located in a single board. Under normal operation, hot spare module outputs are OFF so they do not affect the system output. If a fault is detected on the primary module, the control is automatically switched to the hot spare module, allowing the system to maintain 2-of-3 voting continuously. The faulty module can then be removed and replaced without process interruption.
The hot spare method reduces the probability of a safe failure within a TMR system. For example, when a safe failure occurs in any leg of primary output module that is discovered and the hot spare outputs are passed to the ON state allowing the system to maintain energized condition of system outputs. However, employing a hot spare adds to the number of components in the system increasing the overall system cost. As a further disadvantage, the hot spare is useless when the outputs of faulty modules remain in an ON state, and, thus, cannot prevent the occurrence of a dangerous system failure.
In many safe-critical and critical-control applications, where two faults and even more must be tolerated the TMR and DR systems cannot unfortunately be accepted. The Hybrid Multiple Redundant Computer (HMRC) system (FIG. 1), disclosed in copending patent application Ser. No. 09/506,849 dated Feb. 19, 2000, which is incorporated by reference herein, remains operational in the presence of two concurrent faults until they are detected. The HMRC system 10 contains three parallel operating processing units 12 each of each comprises input module 14, central processor module 16, and output module 50. The central processor module 16 is connected to the associated input module 14 and connected to primary and secondary output circuits 18, 20 located in the associated output module 50 and in the neighboring output module 50 respectively. Each processing unit 12 further includes a watchdog controller 30 that monitors the associated central processor module 16 and transfers an alarm signal 44 to each output module 50 in the event of a central processor module 16 failure. Primary and secondary output circuits 18, 20 in each output module 50 control an output voter network 22 and perform selectable but different logical functions among output data of the respective central processor and modules 16 and alarm signals 44. If alarm signals 44 are not activated, the system generates an output 180 using a two-of-three vote among output data produced by three central processor modules 16. In the event that one or two central processor modules 16 fail, the system is reconfigured to a two-of-two (2-of-2) and to a one-of-one (1-of-1) vote configuration respectively. Each central processor module 16 in turn monitors the status of the output modules and disables outputs of the output module 50 in the event that this module 50 fails. In general, the HMRC system remains operational in the face of as many as two component faults.
The HMRC system utilizes three alarm signals for each output module. It provides the system outputs reconfiguration from the 2-of-3 vote to the 2-of-2 and to the 1-of-1 vote in the presence of single or two faulty output modules respectively. If the HMRC system includes more than one set of the triplicated output modules, the system may use the same set of the three alarm signals for all of the triplicated output modules. In this case, however, a fault occurred in any one output module will lead to an undesirable reconfiguration of outputs in each set of the output modules even though these modules are still healthy. To overcome this problem, the system should be supplied by different alarm signals for each set of the triplicate output modules. The system should also have an associated means for activating only those alarm signals that are associated with the faulty output modules. However, the employ of the additional alarm signals requires the use of additional hardware and additional wires that increases the overall system cost. This disadvantage becomes especially considerable if the system includes a lot number of the remote output modules.
Another drawback of the HMRC system is that each CPM is connected to two output modules for transferring the same output data to each of them consequently. It decreases the throughput of the system since the CPM spends twice as much time for output data transfer.
An object of the present invention is, therefore, to provide an improved hybrid redundant computer system that has not the shortcomings of the existing redundant systems and it is able tolerate up to two faults. The system of the invention is called as Hybrid Triple Redundant Computer (HTRC) system.
In view of this object, the present invention generally provides a hybrid multiple redundant computer system including an input module included a first, a second, and a third input circuit operating in parallel; a first, a second, and a third central processor module operating in parallel, each of which is connected to the associated input circuit of said input module for receiving an input data from said input module and for using the input data as input to a control program to provide output data by execution of said control program; an output module including a first, a second, and a third microcontroller for receiving said output data from the first, the second, and the third central processor module respectively; the central processor module further connected to the associated microcontroller of said output module for transferring said output data to said output module; the output module further included a first, a second, and a third output circuit that are connected to said microcontrollers in a such manner that the first output circuit is connected to the first and to the third microcontroller for receiving said output data from the first and from the third central processor module, the second output circuit is connected to the second and to the first microcontroller for receiving said output data from the second and from the first central processor module, the third output circuit is connected to the third and to the second microcontroller for receiving said output data from the third and from the second central processor module; the output module further comprising a first, a second, and a third watchdog controller each of which is connected to the associated microcontroller for detecting the occurrence of a fault within said microcontroller as well as within the associated central processor module and for activating an alarm signal in the event that said microcontroller or said central processor module fails; the output circuit is further connected to the associated watchdog controller and connected to neighbor watchdog controllers for receiving said alarm signal from any of said watchdog controllers; means in the output circuit for providing its output as a logical product of output data received from two associated central processor modules, said output circuits connected to each other for generating system output as a logical sum of the outputs produced by said output circuits to provide a two-out-of-three vote among output data produced by three central processor modules; means in the output circuit for producing the output of said output circuit as a logical product of output data received from the associated central processor module and from neighbor central processor module if said alarm signal in each watchdog controller is not activated, means for generating said output by only using the output data received from the associated central processor module if at least one out of two alarm signals produced by the neighbor watchdog controllers is activated, and for disabling said output if alarm signal received from the associated watchdog controller is activated, thereby allowing the system to reconfigure from two-out-of-three voting configuration to a two-out-of-two voting configuration in the event that the associated central processor module fails, to a one-out-of-one voting configuration in the event that the associated and any neighbor central processor modules concurrently fail, and to the predetermined safe output condition in the event that each central processor module fails; wherein said means in the first output circuit for producing its output as a logic product of output data received by said output circuit from the first central processor module and from the third processor module if said alarm signal in each watchdog controller is not activated, and generates said output by only using the output data received from the first central processor module if at least one out of two alarm signals associated with second and third watchdog controllers is activated, and for disabling the output of said output circuit if the alarm signal associated with the first watchdog controller is activated; wherein said means in the second output circuit for producing its output as a logic product of output data received by said output circuit from the second central processor module and from the first processor module if said alarm signal in each watchdog controller is not activated, and generates said output by only using the output data received from the second central processor module if at least one out of two alarm signals associated with first and third watchdog controllers is activated, and for disabling the output of said output circuit if the alarm signal associated with the second watchdog controller is activated; wherein said means in the third output circuit for producing its output as a logic product of output data received by said output circuit from the third central processor module and from the second processor module if said alarm signal in each watchdog controller is not activated, and generates said output by only using the output data received from the third central processor module if at least one out of two alarm signals associated with first and second watchdog controllers is activated, and for disabling the output of said output circuit if the alarm signal associated with the third watchdog controller is activated; means in each microcontroller for reading status of the associated output circuit and disabling the output of said output circuit if a fault of said output circuit is discovered; means in each central processor module for reading status of the associated output circuit via the associated microcontroller and disabling the output of said output circuit via the associated microcontroller if a fault of said output circuit is discovered; means in each central processor module for reading status of the associated input circuit and disabling output data of said input circuit if a fault of said input circuit is discovered.