Computer systems for use in critical applications, such as those used in safety systems, and process control systems, are susceptible to system failures. In some circumstances, failures of these critical systems may expose entities to a potentially fatal event, as well as to significant economic loss. For example, such safety-critical control systems are utilized to provide control in critical applications, such as high-integrity pressure protection pipe line systems, emergency stop systems, such as those utilized on drilling platforms, nuclear control systems, oil refinery safety and control systems, boiler control systems, turbo-machinery control systems, and off-shore fire and gas protection systems. To avoid failure, such critical systems monitor various operational processes, such that if a selected value that is associated with a particular process exceeds a predetermined threshold that is indicative of a dangerous operational state, the system takes the necessary action to avoid the occurrence of a complete failure, such as by halting the process or placing the process in a “safe” state. However, in some circumstances, a critical system may perform a “safe” failure of a process, whereby the system mistakenly performs a shutdown process when a shutdown is, in fact, not required. Furthermore, unplanned shut downs resulting from such “safe” failures require a subsequent re-start of the critical process, which leads to lost production and time, which is not desirable. However, if the monitoring systems fail to identify the hazardous or dangerous system parameters or conditions of a critical computing system, a dangerous system failure may occur, which may result in the loss of human life or substantial damage to the operating components or machinery controlled by the process.
In order to avoid the failure of critical computer systems that are responsible for controlling these critical processes, various standards or protocols are utilized to allow such critical computer system to achieve high levels of fault tolerance. Such standards and protocols that may be utilized by these critical computer systems. For example, such critical computer systems may utilize safety integrity level 4 (SIL 4) fault tolerance, as is provided by IEC-61508 and IEC-61511 standards. In addition, such critical computer systems may utilize the Planar 4 system. Planar 4 is based on a hard wired modular electronic circuit, which incorporates fail safe logic that is built into each circuit. The Planar 4 system is certified in accordance with IEC 61508 to a SIL of ¾. Current fault-tolerant systems, such as that provided by Planar 4, utilize a hard-wired computing architecture, which cannot be easily changed or adapted for use in different applications or processes where fault-tolerant control is desired. For example, U.S. Pat. No. 7,877,627 describes a computing system that withstands multiple failures, while still maintaining safety. This system includes three primary processor modules that operate in parallel on a cyclical basis. This computer system further includes three redundant processor modules that also operate cyclically in parallel. A first, second, and third primary processor module are respectively connected to associated first, second and third primary input modules to receive input data therefrom and to use this data as an input for an application program that is executed by each primary processor module. A first, second and third redundant processor module are respectively connected to associated first, second and third redundant input modules to receive input data therefrom and to use this data as an input for an application program that is executed by each redundant processor module. The system further includes an output module that includes first, second and third output module or circuits, which may comprise any suitable output interface electronics that enables the output of data therefrom. Each output module houses a first and a second interface for receiving output data from the primary and redundant processor modules respectively. The primary processor module (PPM) is connected to the associated redundant processor module (RPM) and sends a command to the RPM in order to initiate the execution of one instance of the application program at the same time that the PPM begins execution of another instance of the application program. The PPM and the RPM, therefore, synchronously execute the application program. The output module receives output data from both the associated PPM and RPM close in time during each cycle of the system operation. During normal system operation, each output module generates output data that is produced by the associated PPM and RPM that are equal, and the output module uses output data that is received from the PPM. In the event that the PPM fails permanently, the associated output module uses the output data produced by the RPM. In the event a disparity between the output data that is produced by the PPM and the RPM for some controlled points in a process is discovered, the result of one of the PPM or associated RPM is identified as producing erroneous output data, which is the result of the occurrence of a transient fault in the fault-tolerant computer system. Each output module compares output data that is received from the PPM and the associated RPM to identify whether a possible disparity exists among output data for each controlled point. In the event that a disparity is discovered, the output module disables its own output data for controlled points where a disparity has been identified. The output module communicates with each other during each cycle of the computing system operation in order to receive output data of neighboring output module. During normal system operation, each output module has its own output data, and each output module operates to calculate a logical sum of the output data that it receives from the neighboring output module. The output module further includes a voting network that receives output data directly from the output module and output data that output module received from neighboring output modules. Each voting network includes three electronic switches, such as transistors, that are connected in series. Three of the voting networks are controlled by output data produced by an associated output module based on a first output, and by output data that is the aforementioned logical sum of the electronic valves of different voting networks, that are connected in parallel. Such a configuration provides a system output, which is the result of 2-of-3 majority voting among the output data that the associated output module has received from the associated primary processor module (PPM) or from the redundant processor module (RPM).
The fault-tolerant computer system of the '627 patent may be configured to be operational in the presence of up to two faults. However, the '627 system utilizes a simple watchdog timer (WDT) as its only diagnostic system. The WDT periodically monitors an associated output module of the computer system, and disconnects the output module from participation in the computer system output when the output module fails. Unfortunately, it is difficult to configure the WDT to detect faults with a probability that is greater than about 90%. Accordingly, the WDT is unable to effectively discover failures that may occur in the output module of the fault-tolerant computer system. Thus, in some circumstances, if the output controller in the output module fails due to a hard failure, and this failure is not discovered by the WDT, the system performs a “false trip”. A false trip may lead to substantial financial losses, as well as significant harm to property or to the individual. Another disadvantage of such system is that it has about double the number of input modules, which increases the overall cost of the system. Furthermore, this fault-tolerant computer system unfortunately does not include variants to allow it to operate with input/output (I/O) modules that are located in close proximity to a controlled process, but that are also far away from a central computing unit or processor.
A safety instrumented system (SIS) includes two identical channels having a read-back diagnostic, which enables the system to operate in the presence of any single failure. Such SIS systems, unfortunately, are not able to tolerate the occurrence of two concurrent faults. Accordingly, the various embodiments of the system discussed herein provide a dual-channel SIS that includes a diagnostic that allows the system to remain operational after the occurrence of some kinds of two concurrent faults.
U.S. Patent No. 2016/0283426 describes a control system comprising a first and a second controller module, where each controller module includes management circuitry that identifies which controller module operates in a master mode or a slave mode. This control system operates with the first controller module, but switches to a second controller module when the first controller module fails. Unfortunately, such system has no means for determining which controller module is the first or second by default after power up. In contrast, the various embodiments of the redundant computer system disclosed herein include a primary and a secondary processor module, with each processor module including hardware and software means that define which processor module is by default the primary or secondary processor module. In addition, such hardware and software means of the various embodiments of the redundant computer system also continuously enables each processor module to change from a primary status to a secondary status in the event that the primary processor module fails.
Therefore, there is a need for a fault-tolerant computer system that overcomes the deficiencies of the current systems, including that of the '627 patent and the '426 publication, discussed above, and that provides, in some embodiments, uninterrupted system operation that is capable of attaining safety levels in accordance with one or more standards/protocols, such as SIL 4/IEC 61508 for example.