The evolution of computers has allowed the proliferation of programmable control systems for handling critical tasks, such as industrial control of oil, gas, nuclear, and chemical processes, patient monitoring, aircraft flight control, and military equipment among others. Within these systems, emergency shutdown systems are used in safety-critical applications for monitoring processes and removing the process to a safe state when selected process variables fall outside of a safe range.
Computer systems devoted to industrial safe and critical control applications must have extremely high degrees of safety and reliability since faults in computer systems can cause vast economic loses and endanger human beings. A system failure that occurs as a result of a system component fault can be either safe or dangerous. A safe failure occurs when a system has failed into a safe state, or in other words, where the system does not disrupt the operation of other systems or compromises the safety of personnel associated with the system. The safe failure occurs, for example, when an emergency shutdown system (ESD) fails in such a way that it causes a shutdown not associated with the controlled process. A dangerous failure is a failure that prevents the system from responding to hazardous situations, allowing hazards to develop. For instance, a dangerous failure occurs when the ESD cannot perform a required shutdown.
Redundant configurations of computing systems have been used in research and designs to provide system fault tolerance, which is concerned with the continuation of correct operation of a system despite occurrence of internal fault recovery. Computer systems for use in safety-critical and critical-control applications are usually developed through either Triple Redundant (TMR) or Dual Redundant (2oo2D) architecture.
The TMR system is the most common form of voting based systems. The TMR that is disclosed in the U.S. Pat. No. 6,449,732 includes three identical control channels each of which independently executes an application program in parallel with the other channels. Each channel houses a main processor module (MPM) that communicates with respective legs located in I/O modules. Each input and output module of the TMR system comprises three identical legs in a redundant configuration. This system performs a majority voting for all digital inputs and outputs from a field for masking possible input/output faults. By using three identical channels in combination with voting mechanisms, any single fault is masked by the 2-of-3 voting, so any single fault does not lead to the system failure. The TMR system is also able to remain operational although in the presence of up two faulty main processor modules since one healthy MPM can manage system functions.
In many cases, however, two concurrent faults lead to a system failure. The primary difficulty with the TMR system is the voter. If the TMR is used as an Emergency Shutdown system, it usually deploys digital output modules each of which has outputs that shall be in ON condition per controlled point under normal operation and in OFF condition for a shutdown. The digital output module uses a quadruplicated output voter circuitry per point that provides the two-out-of-three voting among outputs of three legs. The quadruplicated output voter circuitry consists of two parallel pair path each includes two switches in serial. In the event that two switches located in parallel branches of output voter circuitry concurrently fail, the TMR will make a false shutdown since it not associated with the controlled process. In the event that two switches in series concurrently fail remaining in ON condition permanently, a dangerous failure occurs since the system becomes unable to make a shutdown if it is required. In such event the system becomes inoperative and it shall to use some external means for making the shutdown to avoid the dangerous failure. In both of these scenarios the TMR system becomes inoperative as a result of two concurrent faults occurring. The TMR system also may fail when two legs in the output module concurrently fails.
A less expansive way for achieving fault tolerance and increase reliability is deploying the Dual Redundant architecture of the system. The Dual Redundant system (DRS) such as 2oo2D system described in the book Goble, W. M. “Control Systems Safety Evaluation & Reliability”, ISA (1998) pp. 364–375. The DRS includes two programmable controllers operating in parallel. Each controller has a central processor module and a set of the associated input/output modules. Each controller also incorporate an independent diagnostic module that opens a special output switch for de-energizing controller outputs in the event that the controller fails. The DRS remains operational in the presence of one faulty controller and it makes a shutdown in the event that both controllers fail concurrently. In general, the system provides no single point of failure in regard to persistent faults that can occur in system components. As well as the TMR, the 2oo2D system only guaranties a single fault tolerance since it may often become inoperative in the presence of two faulty components.
The DRS controllers are relatively simple and considerably less expansive that TMR controllers. However, fault tolerance and reliability of the 2oo2D system depends in great part on the fault coverage, which is defined as the probability that a failure will be detected/recovered if it occurs. In contrast to the voting based system, the 2oo2D system has no property of fault masking. Hence, in the event that two controllers produce different outputs the system shall make a shutdown in a case that a fault in one controller cannot be detected and the system has no means to define which controller produces an error output.
The enhanced TMR system, such as the Hybrid Triple Redundant (HTR) computer systems disclosed in the U.S. Pat. No. 6,510,018 and U.S. Pat. No. 6,670,038 combine two-out-of-three voting with diagnostic and fault recovery means configured such that the system remains operational in the presence of multiple faulty components. In general, the system is guaranteed to operate properly in the presence of up to two hard faults and may continue to operate properly with three and more faults that have persisted in any system components. As well as the conventional TMR, the HTR system includes three identical control channels, each of which independently executes an application program in parallel with the other channels.
The HTR is different with the TMR that the HTR system As well as the TMR, the HTR system includes three identical control channels, each of which independently executes an application program in parallel with the other channels. The HTR is different with the TMR that the HTR system employs an innovative scheme to the output module and of the output voter that remain operational in the presence of any two faulty components and it may operate properly in the presence of more than two faults.
The HTR system provides fault recovery means to disable outputs of a faulty leg of the output module and pass control of the system outputs to neighbor legs for providing 3-2-1-0 mode of operation. With three channels running, a two-out-of-three (2-of-3) vote for a shutdown is used. In the event that one channel fails, the voting becomes two-out-of-two (2-of-2). The failure of a second channel causes the HTR to revert to a one-out-of-one (1-of-1) mode. The failure of a third channel causes the HTR to go to a fail-safe state, i.e. to make a shutdown. These means also ensure that the output voter of each controlled point remains operational in the presence of any two faulty switches and allows the output voter to operate properly in the presence of certain three and more faults. For example, the output module is able to revert from 2-of-3 vote to 2-of-2 vote if any two switches in series concurrently fail remaining ON condition permanently.
The leading TMR systems are also able to operate in the 3-2-1-0 mode but they cannot extend this capability completely to their input and output modules since any output module may fail in the event that certain of its two components concurrently fail. In such event that two output switches in the voter circuitry concurrently fail in such a way that system output(s) is permanently ON, the TMR system shall initiate a shutdown to avoid a dangerous system failure.
The HTR architecture provides a major improvement of system fault tolerance, safety, and reliability in the comparison to the control systems based on TMR or Dual Redundant (DRS) architecture that only guaranty a single fault tolerance. It allows the user to deploy the HTR system for very responsible applications where two faults and even more must be tolerated hence the TMR and the DRS system cannot be accepted. The HTR controller can be implemented with comparable cost to the TMR controller. Unfortunately both HTR and TMR controllers are considerable more expansive that the Dual Redundant controller. It is especially true in regard to industrial control applications that need the user to write pretty complex application program required a vast capacity of memory in central processor module. For such type of applications the TMR controller shall provide at least 16 Mb-memory for user-written program and up to 8 Mb for an operating system that controls the off-line/start-up, I/O data polling, I/O modules communications, on-line continuous diagnostics, and external communications. To implement functions listed above, the central processor module (CPM) usually incorporates basis components such as a powerful main processor, an I/O processor, a communication processor and a high capacity of DRAM and Flash memory. Because of that the CPM becomes very expensive, three central processor modules bring significant contribution to the total system cost.
Another difficulty with TMR and HTR is the synchronization for ensuring that each central processor operates in synchrony with the other two central processors, as a member of a triad. Each of the processors communicates with its neighbors for synchronization at least one per application execution cycle, and each processor reads the input, output and diagnostic status of its neighbors. The processors then vote input data and utilize outvoted data as input to the application program. The synchronization and voting procedure are time consuming hence they have a significant impact on system speed. Synchronization problems increase dramatically when the system operates with a lot amount of inputs and outputs. In this case, means that provide synchronization and voting can be pretty complicated and expensive for avoiding possible synchronization errors and handling high system throughput.
According to statistics data, based on industry studies (Ref. 1), about 95% of control system failures are caused by malfunctions occurred in I/O subsystems and field devices such as sensors and final control elements. Only 5% of control system failures occur as a result of failures in central processor modules. Because I/O modules and field devices are most vulnerable components of control systems, the HTR configuration of I/O modules is a good solution since it provides a highest possible level of fault tolerance in the comparison with existing I/O configurations. In other hands, central processor modules are fewer subjects to failures than I/O modules but they bring significant contribution to the total system cost. Therefore, the use of dual redundant configuration of central processor module would be useful for making the system significantly less expansive than it is with triple redundant processor modules.
The Dual/Triple Redundant (DTR) system comprises dual redundant central processor modules and triple redundant configuration of I/O modules. The DTR system is provided either for parallel redundant or in hot standby version. The DTR system has no single point of failure with respect to the CPM and it remains operational in the presence of any two or even more failures in any of the I/O modules. Since the DTR system is significantly less expensive than the HTR for the same application, the user can get economical benefit without significant sacrifice of system reliability.
The DTR system that is assigned for operating with a large amount of I/O modules comprises a plurality of I/O subsystems working simultaneously for input data collecting and two-out-of-three voting to relieve central processors module from these procedures. Synchronization problems as well as the time required to collect the input data and perform the two-out-of-three voting is considerably decreased since each I/O subsystem operates with relatively small number of I/O points. Besides of that, input data collecting and two-out-of-three voting overlapping the application program execution. It allows significantly increase the system throughput.