The present invention relates to a self-checking circuit and its method of operation. More particularly, it concerns a self-checking circuit useful for a highly reliable system configuration.
Also, the present invention relates to a method of management of a redundant resource, and more particularly concerns an effective use of the redundant resource in a fault tolerant computer system.
Control systems for airplanes, trains, automobiles, and similar means of transportation have been increasingly integrated as advanced control performances are needed to increase energy (fuel) efficiency, operationability, comfortability, and the operation speeds thereof. To run the transportation systems safely, the control systems thereof are forcefully required to be high in reliability and fail-safe performance so that no dangerous output is caused by occurrence of a fault.
To assure the reliability and fail-safe performance of a the control system, it is to have important for the control system to have a capability of detecting the occurrence of a fault, that is to have, a self-checking capability. To accomplish such a self-checking capability, a so-called redundant code is generally used that has a hamming distance of higher than 2 between codes, such as the M-out-of-N code and two-rail logic (1-out-of-2 code) that can be regarded as a kind of M-out-of-N code. The redundant code can perfectly detect a fault as long as it is a single fault. However, it cannot always detect multiple faults. If a self-checking circuit is accomplished in an LSI, a fault may spread over the whole chip. This would be a phenomenon equivalent to the occurrence of multiple faults. Assuming that errors are random, Eq. 1 below gives a probability .eta. of the occurence of wrong output signals due to a fault coincident with code points in a specific output code space O. EQU .eta.=No/Nu (1)
where No is a number of the code points in the output code space O and Nu is a number of the code points. Therefore, it is a problem how to increase Nu to No to increase the detection rate. PA1 (1) A method of forming the whole circuit of redundant codes. PA1 (2) A method of replicating function blocks and using a self-checking comparison circuit formed of redundant codes to compare signals output from the function blocks. PA1 (1) The computer module broadcasts its fault occurrence information (fault detection results) and process results to the other computer modules at a proper timing (check points) during processing a the task. PA1 (2) The computer modules calculate their respective evaluation functions Fij, where i is a processor number and j is a task number. The evaluation function Fij can be regarded as a margin for the responsibility to be taken on by the computer module for the task. It is based on equality or inequality of the fault occurrence information (fault detection results) and process results broadcast from the other computer modules. PA1 (3) Each of the computer modules decides task j for minimizing the evaluation function Fij as a process to execute before switching the task in process to the process to be executed. PA1 where Lthij is a threshold value of the reliability level of task j in the computer module i, Lrj is the reliability level of task j, i is the computer module number, and j is the task number. PA1 where Pej is a probability of wrong calculation results of task j.
There are the following two methods to accomplish a self-checking circuit having such redundant codes as described above.
The method (1) above is involved in problems that the circuit must be newly designed to make self-checking possible and it is difficult to optimize its operation speed.
On the other hand, the method (2) has the advantage that a conventional processor, memory, and other devices can be used for the function blocks, since only the comparison circuit is required to be newly designed in redundant logic. This can decrease the development cost to a great extent. It also can easily make the operation speed high since advanced semiconductor techniques can be used. The self-checking coverage of the method (2) greatly depends on that of the comparator.
Accordingly, to provide a self-checking comparator, it was proposed to use redundant codes, such as the M-out-of-N code and two-rail logic (1-out-of-2 code), for the logic itself used in the comparison circuit. See, for example, Yoshihiro Toma, "Theory of Fault Tolerant System," Association of Electronics, Information and Communications, 1990. To realize a self-checking comparator, the RCCO (Reduction Circuit for Checker Output) circuit shown in FIG. 2.5 on page 31 of the publication was connected to a tree structure as shown in FIG. 2.6 on page 32 thereof.
The probability of a fault occurring in the circuits to be compared is low. It is therefore rare that the signals to be compared do not coincide. This means that it is rare that a path to be activated upon detection of an inequality is activated. If there occurs such a mode of fault as fixing, so that the signal output of the path always represents an `equality,` it is feared that the fault is made latent. The comparison circuit, therefore, not only uses the redundant code described above, but also uses a frequency logic, alternating checking method, or similar dynamic logics of alternating signal levels as a signal indicating that the circuit is normal (hereinafter referred to as a signature signal), in place of the binary level logic of 0 and 1. As an example, we can use a method of repositioning a permuter for injecting a simulated fault for testing into the RCCO shown in FIGS. 2.15 and 5.16 on page 42 in the abovementioned "Theory of Fault Tolerant System." In this way, an alternating output signal is obtained if the operation is normal. However, the alternating output signal is not obtained, on the other hand, if a fault is caused by a change of a threshold value of a semiconductor device or a fault due to a change of a dc characteristic of the device, such as a failure stacked at 0 or 1. The method also injects a simulated fault periodically to always confirm operation of the error detection feature. These advantages can provide a circuit with an increase in self-checking performance to a great extent.
The above-described prior art has the disadvantage that adverse effect of crosstalk or shortcircuit between wiring nets in the semiconductor device is likely to occur. If a fault of the semiconductor device causes crosstalk between the wiring nets or a shortcircuit between the wiring nets if migration of a wiring material or poor insulation between insulation layers causes a shortcircuit, the wiring net that should not have a signature signal itself may have a signature signal of another wiring net induced thereinto adversely (hereinafter referred to as a counterfeit signature). In general, a fail-safe circuit has a signature signal to indicate that the circuit is normal. The circuit may recognize that it is normal in spite of the counterfeit signature due to crosstalk or a shortcircuity however, there is the fear that the fail-safe performance of the circuit may be lost.
To prevent such an occurrence of crosstalk and shortcircuit, the prior art has a special design restriction in the wiring spaces. However, this technique requires transistors and wiring lines on the semiconductor substrate having restrictions which are quite different from those of general semiconductors. It cannot have any of the convenience of the prior and automatic designing tools. Most designing works must be performed manually.
Further, computers and transportation controls bear central roles for finance and similar social key industries and parts involved in human life in controlling spaceships and airplanes in recent years. System breakdown or wrong system operation due to a fault of the computers can spread to cause fatal effects in society. In such a trend, high reliability of the computers is increasingly needed.
To make the computers reliable, redundancy is generally employed by providing extra computers and units forming the computer in advance.
On the other hand, the redundant hardware provided to make the computer highly reliable results in a great increase in the cost, dimensions, weight, and power consumption. To enhance the investment effect, or the cost performance, of the fault tolerant computer system, it is necessary to increase the redundant hardware resource effectively with respect to the reliability and processing performance thereof.
There is a method of redundant resource management to use the redundant hardware resource. That is proposed by Jean-Charles Fabre, et al., "Saturation: reduced idleness for improved fault-tolerance," Proc. FTCS-18 (The 18th Int'l Symp. on Fault-tolerant Computing), pp. 200-205, 1988.
The proposal by Jean-Charles Fabre, et al., mentioned above, has MNC (minimum number of copies), or redundant copies, provided in advance to be simultaneously executed for each of a plurality of tasks. If a number of idle nodes (redundant computer modules) is larger than the MNC at the time of arrival of a task execution request, the idle nodes start execution of the task. If the number of idle nodes is smaller than the MNC, the system waits until current execution of the tasks ends so as to have a required number of idle nodes.
The proposal by Jean-Charles Fabre, et al., mentioned above is a useful method of redundant resource management for an OLTP (online transaction processor) that has a task start request made frequently.
However, the prior art lacks sufficient consideration of the occurrence of a fault and further occurrence multiple of faults with a view toward providing a highly reliable real time control computer. This is due to the fact that the proposed method is based on the assumption that the task execution time is sufficiently shorter than the MTBF (mean time between failures) with respect to the operational characteristic of the OLTP that the transaction ends in a short time. However, the real time control computer often has tasks executed for a long period of time. The computer of an airplane, spaceship, etc., for example, must not only run for the mission time normally, but also must provide support even when halting the mission. For the reason, this task execution time cannot be ignored as compared with the MTBF. We must take into account the occurrence of a fault and further occurrence of multiple faults.
The above-described prior method has a number of assigned computer modules managed only at the time of task execution start. Therefore, no computer modules are newly added even if the task executing computer module is caused to fail to function by occurrence of a fault during execution of a task. This means that if a fault occurs during execution of the task, task execution is continued while the degree of redundance is decreased that is the number of computer modules that is redundantly executing the task. The reliability of the task is lost. If one of two computer modules redundantly executing a task fails to function, for example, should a second fault occur continually, execution of the task is halted.