(1) Field of the Invention
The present invention generally pertains to a method for designing high integrity logic circuits. It is particularly directed toward safety-related control systems, including nuclear plant reactor protection systems, where integrity and reliability are of the highest importance. The present invention is especially directed toward implementing the methods in a logic device such as PAL, CPLD, FPGA, ASIC, or Gate Array, or in a combination of multiple logic devices. Such logic devices are commonly installed on printed circuit boards.
(2) Description of Related Art
Others have attempted to improve the reliability of mission critical logic components in a computerized system. For example, U.S. Pat. No. 7,290,169 describes a core-level processor lock stepping system where two microprocessors are operated in parallel, and they each provide an external output signal which is compared. The microprocessors are meant to operate in lockstep, that is, to operate in a tightly coordinated manner so that their outputs will match in a reliable manner. In actual practice, this method has a number of problems for safety critical systems. It is difficult to keep the microprocessors completely in lockstep. There can be hidden failures in the system which are not uncovered until a system is actually used.
U.S. Pat. No. 7,237,144 provides similar operational thinking and difficulties but provides off chip lockstep checking to combat “soft errors.” It has the same difficulties as just described.
U.S. Pat. No. 6,233,702 describes a complicated multiple processor system providing fault tolerant data processing by employing hardware (e.g. fail functional, employing redundancy) and using software techniques (fail fast e.g. employing software recovery with high data integrity hardware). The error checking specifically avoids the utilization of redundancy to compare key data points between parallel processors, and instead only compares points that operate at slower rates such as at I/O points or in the main memory. This design is overly complicated and has a problem with unannounced errors which will be discussed shortly. It is a software based system with problems that will also be discussed shortly.
U.S. Pat. No. 7,134,104 describes a method of improving fault tolerance in an FPGA by creating at least three parallel copies of logical functions, and then using a voting scheme to determine if any particular copy is faulty. While this method generally improves fault tolerance, it is not a satisfactory scheme for a safety critical environment where it cannot be certain that the majority vote is always the non-faulty result.
U.S. Pat. No. 5,144,230 describes a self test circuit by a method called cycle stealing. The output signal from a ‘circuit under test’ is tested by selectively applying a test input signal when the output signal is not required to perform it normal function. Though this is one possible method of checking a processor, the testing does not provide any protection against failures affecting dependent systems. When parallel redundancy is used, a voter scheme is used to determine the non-faulty result. These methods are unacceptable for a safety critical environment where a highly reliable system is desired.
US application 2007/0022348 describes parallel lock step cores which are similar to U.S. Pat. No. 7,290,169 already described except that intermediary values from the cores are also compared along with outputs. However, this system has all of the problems in maintaining two cores in lockstep. For example, when there is an error, caches have to be loaded into the system memory to ensure the lockstep is maintained going forward. The caches have to be maintained and verified on an ongoing basis when there are system or programming changes. The system is also software based.
There is a need in the art to provide a highly reliable system that is not a software based system. For example, in a safety critical system, such as a nuclear plant protection system, it is undesirable to be dependent upon executable software due to the nature of potential errors. Software has inherent operational problems that are difficult to resolve. Even relatively simple systems require a significant amount of program code. In particular, a software-microprocessor system is subject to common mode failure where parallel redundant systems may fail simultaneously due to a fault condition.
In spite of redundancy that may be included within software-microprocessor systems, a fault may occasionally affect enough redundant functions that it is not possible to correctly pick a non-faulty result, and the system will experience a common-mode failure. The common-mode failure may result from a single fault or several faults. It is known that microprocessor based systems are vulnerable to common-mode failures where redundant copies of software fail under the same fault. The common-mode fault, in particular, makes software-microprocessor systems undesirable in a plant protection system.
For the purposes of the present invention, the following definitions apply. A failure is the termination of the ability to perform a required function. See also mission failure. Failures may be unannounced and not detected until the next test which is called an unannounced failure. They may be announced and detected by any number of methods at the instant of occurrence which is called an announced failure. A mission is the singular objective, task, or purpose of an item or system. A mission failure is the inability to complete a stated mission within stated limits. Critical functions are the functions needed in a logic circuit in order for it to perform its mission.
In a safety related control system, a high integrity system will have two critical features:                1) It will perform its mission when called upon. The mission will typically be to actuate field devices when a predefined set of input conditions are present. To have a high assurance of performing its mission when called upon, no unannounced failures must exist in system. Unannounced failures can cause the system to malfunction at the moment its mission is called upon. This means all failures must be detected and announced.        2) Unintended actuations of the control system due to logic circuit failures must be avoided. These actuations cause the field devices to perform their safety functions which are often costly. To do this all failures must be isolated contained before they reach the field device.        
A common method for increasing reliability and availability in logic circuits used in critical applications is to use triple or more redundancy (TMR). This is commonly done in nuclear, space and military applications. Having TMR logic circuits, with a majority voting scheme allows for fault tolerance. If a majority of the redundant logic circuits are without failures, the system will perform its function. Unfortunately, if the majority is in error compared to the minority, the system will be utilizing an error in its function.
If failures are allowed to accumulate in a TMR system it could have catastrophic effects. In particular, if it is applied to a safety critical application, the system could fail in its function to shut a system down or take appropriate corrective action to eliminate a problem before it becomes critical.
Failures in TMR logic circuits can be detected by comparing the output between the redundant logic circuits. However it cannot detect unannounced failures, i.e., failures in logic circuits which do not result in an output change. Unannounced failures in the system are not found until the particular logic function is exercised. That is, until the particular logic pathways are utilized.
Unannounced failures are particularly a problem in nuclear safety systems which are normally in a “waiting” position where no inputs or outputs are changing state. The Safety Systems may remain in this state for extended periods of time allowing unannounced failures to accumulate. Unannounced failures may sit undetected for weeks, months, or even years.
Adding TMR to a system inherently adds complexity which reduces overall reliability. Maintenance is increased by the additional logic and programming added. Adding additional redundant modules (4 or more) will improve protection against unannounced failures by decreasing their probability of building up and affecting the voting logic, but at the expense of a proportional decrease in reliability and increase the complexity.