The present invention relates to a self-checking circuit and its method of operation. More particularly, it concerns a self-checking circuit useful for a highly reliable system configuration.
Also, the present invention relates to a method management of a redundant resource, and more particularly concerns an effective use of the redundant resource in a fault tolerant computer system.
Control systems for airplanes, trains, automobiles, and similar means of transportation have been increasingly integrated as advanced control performances are needed to increase energy (fuel) efficiency, operationability, comfortability, and the operation speeds thereof. To run the transportation systems safely, the control systems thereof are forcefully required to be high in reliability and fail-safe performance so that no dangerous output is caused by occurrence of a fault.
To assure the reliability and fail-safe performance of a control system, it is important for the control System to have a capability of detecting the occurrence of a fault, that is, to have a self-checking capability. To accomplish such a self-checking capability, a so-called redundant code is generally used that has a hamming distance of higher than 2 between codes, such as the M-out-of-N code and two-rail logic (1-out-of-2 code) that can be regarded as a kind of M-out-of-N code. The redundant code can perfectly detect a fault as long as it is a single fault. However, it cannot always detect multiple faults. If a self-checking circuit is accomplished in an LSI, a fault may spread over the whole chip. This would be a phenomenon equivalent to the occurrence of multiple faults. Assuming the errors are random, Eq. 1 below gives a probability xcex7 of the occurrence of wrong output signals due to a fault coincident with code points in a specific output code space 0.
xcex7=No/Nuxe2x80x83xe2x80x83(1)
where No is a number of the code points in the output code space 0 and Nu is a number of the code points. Therefore, it is a problem how to increase Nu to No to increase the detection rate.
There are the following two methods to accomplish a self-checking circuit having such redundant codes as described above.
(1) A method of forming the whole circuit of redundant codes.
(2) A method of replicating function blocks and using a self-checking comparison circuit formed of redundant codes to compare signals output from the function blocks.
The method (1) above is involved in problems that the circuit must be newly designed to make self-checking possible and it is difficult to optimize its operation speed.
On the other hand, the method (2) has the advantage that a conventional processor, memory, and other devices can be used for the function blocks, since only the comparison circuit is required to be newly designed in redundant logic. This can decrease the development cost to a great extent. It also can easily make the operation speed high since advanced semiconductor techniques can be used. The self-checking coverage of the method (2) greatly depends on that of the comparator.
Accordingly, to provide a self-checking comparator, it was proposed to use redundant codes, such as the M-out-of-N code and two-rail logic (1-out-of-2 code), for the logic itself used in the comparison circuit. See, for example, Yoshihiro Toma, xe2x80x9cTheory of Fault Tolerant System,xe2x80x9d Association of Electronics, Information and Communications, 1990. To realize a self-checking comparator, the RCCO (Reduction Circuit for Checker Output) circuit shown in FIG. 2.5 on page 31 of the publication was connected to a tree structure as shown in FIG. 2.6 on page 32 thereof.
The probability of a fault occurring in the circuits to be compared is low. It is therefore rare that the signals to be compared do not coincide. This means that it is rare that a path to be activated upon detection of an inequality is activated. If there occurs such a mode of fault as fixing, so that the signal output of the path always represents an xe2x80x98equality,xe2x80x99 it is feared that the fault is made latent. The comparison circuit, therefore, not only uses the redundant code described above, but also uses a frequency logic, alternating checking method, or similar dynamic logics of alternating signal levels as a signal indicating that the circuit is normal (hereinafter referred to as a signature signal), in place of the binary level logic of 0 and 1. As an example, we can use a method of repositioning a permuter for injecting a simulated fault for testing into the RCCO shown in FIGS. 2. and 5.16 on page 42 in the abovementioned xe2x80x9cTheory of Fault Tolerant System.xe2x80x9d In this way, an alternating output signal is obtained if the operation is normal. However, the alternating output signal is not obtained, on the other hand, if a fault is caused by a change of a threshold value of a semiconductor device or a fault due to a change of a dc characteristic of the device, such as a failure stacked at 0 or 1. The method also injects a simulated fault periodically to always confirm operation of the error detection feature. These advantages can provide a circuit with an increase in self-checking performance to a great extent.
The above-described prior art has the disadvantage that an adverse effect of crosstalk or shortcircuit between wiring nets in the semiconductor device is likely to occur. If a fault of the semiconductor device causes crosstalk between the wiring nets or shortcircuit between the wiring nets if migration of a wiring material or poor insulation between insulation layers causes a shortcircuit, the wiring net that should not have a signature signal itself may have a signature signal of another wiring net induced thereinto adversely (hereinafter referred to as a counterfeit signature). In general, a fail-safe circuit has a signature signal to indicate that the circuit is normal. The circuit may recognize that it is normal in spite of the counterfeit signature due to crosstalk or a shortcircuit; however, there is the fear that the fail-safe performance of the circuit may be lost.
To prevent such an occurrence of crosstalk and shortcircuit, the prior art has a special design restriction in the wiring spaces. However, this technique requires transistors and wiring lines on the semiconductor substrate having restrictions which are quite different from those of general semiconductors, it cannot have any of the convenience the prior art and automatic designing tools. Most designing works must be performed manually.
Further, computers and transportation controls bear central roles for finance and similar social key industries and parts involved in human life in controlling spaceships and airplanes in recent years. System breakdown or wrong system operation due to a fault of the computers can spread to cause fatal effects in society. In such a trend, high reliability of the computers is increasingly needed.
To make the computers reliable, redundancy is generally employed by providing extra computers and units forming the computer in advance.
On the other hand, the redundant hardware provided to make the computer highly reliable results in a great increase in the cost, dimensions, weight, and power consumption. To enhance the investment effect, or the cost performance, of the fault tolerant computer system, it is necessary to increase the redundant hardware resource effectively with respect to the reliability and processing performance thereof.
There is a method of redundant resource management which uses a redundant hardware resource. That is proposed by Jean-Charles Fabre, et al., xe2x80x9cSaturation: reduced idleness for improved fault-tolerance,xe2x80x9d Proc. FTCS-18 (The 18th Int""l Symp. on Fault-tolerant Computing), pp. 200-205, 1988.
The proposal by Jean-Charles Fabre, et al., mentioned has MNC (minimum number of copies), or redundant copies, provided in advance to be simultaneously executed for each of a plurality of tasks. If a number of idle nodes (redundant computer modules) is larger than the MNC at the time of arrival of a task execution request, the idle nodes start execution of the task, if the number of idle nodes is smaller than the MNC, the system waits until current execution of the tasks ends so as to have a required number of idle nodes.
The proposal by Jean-Charles Fabre, et al., mentioned above is a useful method of redundant resource management for an OLTP (online transaction processor) that has a task start request made frequently.
However, the prior art lacks sufficient consideration of the occurrence of a fault and further occurrence of multiple faults with a view toward providing a making highly reliable real time control computer. This is due to the fact that the proposed method is based on the assumption that the task execution time is sufficiently shorter than the MTBF (mean time between failures) with respect to the operational characteristic of the OLTP that the transaction ends in a short time. However, the real time control computer often has tasks executed for a long period of time. The computer of an airplane, spaceship, etc., for example, must not only run for the mission time normally, but also must provide support even when halting the mission. For this reason, the task execution time cannot be ignored as compared with the MTBF. We must take into account the occurrence of a fault and further occurrence of multiple faults.
The above-described prior method has a number of assigned computer modules managed only at the time of task execution start. Therefore, no computer modules are newly added even if the task executing computer module is caused to fail to function by occurrence of a fault during execution of a task. This means that if a fault occurs during execution of the task, task execution is continued while the degree of redundance is decreased that is the number of computer modules that is redundantly executing the task. The reliability of the task is lost. If one of two computer modules redundantly executing a task fails to function, for example, should a second fault occur continually, execution of the task is halted.
A first advantage of the present invention consists in particular in the fact that a logic circuit having an error detection function that has a plurality of function blocks feeding out a plurality of signals, which are at least duplexed, compares the output signals of the function blocks, and detects an error on the basis of results of the comparison. The logic circuit comprises synthesizing means provided to superimpose inherent waveforms assigned in advance to the respective output signals of the function blocks onto the output signals of one of the function blocks, and comparison means for comparing a signal output of the synthesizing means with the signal output of the other function block to detect an error.
For a semiconductor device, as an example, an inherent signal waveform is assigned to each of the wiring nets corresponding to the above-mentioned output signals as a signature. The signature should be regarded as authentic only if the signal waveform coincides with the one inherent to the wiring net.
To distinguish an authentic signature from a counterfeit signature, it is desirable to make the signatures inherent to wiring nets which do not correlate to one another. Orthogonal functions are well known not to correlate to one another. Functions fi(x) and fj(x) are orthogonal to each other when                                           ∫                          -              ∞                        ∞                    ⁢                                                    fi                ⁡                                  (                  x                  )                                            ·                              fj                ⁡                                  (                  x                  )                                                      ⁢                          xe2x80x83                        ⁢                          ⅆ              x                                      =        0                            eq. 2            
A wavelet analysis that can analyze a signal waveform in a time-frequency domain has been noted recently in place of the conventional Fourier analysis. The original wavelet also is an orthogonal function. A triangular function and wavelet are analog functions. To use these in a digital circuit, they should be made binary.
With the first feature of the present invention, for a semiconductor device, as an example, an inherent signal waveform is assigned to each of the wiring nets as a signature. The signature should be regarded authentic only if the signal waveform coincides with the one inherent to the wiring net. If a fault of the semiconductor device causes crosstalk between the wiring nets on if migration of a wiring material or poor insulation between insulation layers causes shortcircuit, the wiring net may have a counterfeit signature signal of another wiring net induced thereinto adversely. Should this happen, the counterfeit signature can be distinguished from the authentic signature, since the counterfeit signature does not coincide with the signal waveform inherent to the wiring net. This means that the present invention needs no special wiring restriction to prevent crosstalk or shortcircuit of the type which is indispensable to the prior art to fully detect faults. In addition, the present invention assures fail-safe performance.
The effectiveness of said conventional technology is based on the presumption that the fault detected in either of the said at least dualized function blocks is independent of the other function block. In other words, it is presumed that the same fault never occurs in both of the at-least dualized function blocks at the same time. If the same fault occurs in both of the dualized function blocks at the same time, the fault output from both of said dualized function blocks match and it becomes impossible to detect the fault by comparing them. This becomes a big problem when dualized function blocks are arranged in the same semi-conductor chip. Such problems may be solved by providing the following control methods according to the invention.
The following means, that is called diversity, may be taken to guarantee the independence of faults which occur in either of the said at-least dualized function blocks.
(1) Design Diversity
Design diversity is an effective means to eliminate the influence of faults caused by designs. Especially, N-Version Programming for software is well known. The N-Version Programming is a method to execute N versions of a program that are developed with the same specifications concurrently. Also, in case of hardware, this design diversity can be materialized by developing circuits with the same specifications in N ways. According to this method, however, the number of processes and the expense are increased by N times that of an ordinary method for the design and development. Thus, this approach is not effective or desirable.
To reduce the number of processes and the expense in designing hardware, therefore, the following method is adopted according to this invention.
The main approach to the design of modern hardware is using HDL (Hardware Description Language) to create a file (logical description) that describes the functions and specifications of the subject logical circuits and to create another file (logical net list) that describes the connections of the said logical circuits using a logical synthesis tool on the basis of the HDL. In addition, the said logical net list file is converted to a (physical net list) file that describes the wiring and layout of transistors on the actual semi-conductor chip using an auto wiring tool to create the necessary masks and manufacture semiconductor elements.
In this case, the design constraints, such as the delay time, occupation area, etc., as well as the subject algorithm can be changed for logical synthesis and automatic wiring to diversify the target logical net list and physical net list.
The said dualized function blocks can thus be materialized in the subject semi-conductor chip on the basis of the logical description of the said logical blocks by selecting two physical net lists from among the diversified plural physical net lists.
To select two physical net lists from among many, it is only needed to define a correlation function that indicates how much those physical net lists resemble each other and select a combination of the physical net lists so that the correlation function may be minimized. In this case, fault characteristics of the semiconductor must be affected in the correlation function. In general, wire intersection is pointed out as a weak point of semiconductors. At a wire intersection, two wires are separated only by a thin film oxide, so short-circuits between wires and faults such as crosstalk, etc. are apt to occur. Furthermore, since a wire crosses over the other at such a wire intersection, the wire located at the different level is often cut off with stress. In other words, the status of the intersection between wires affects the fault characteristics of semiconductors. A correlation function in which the fault characteristics of the semi-conductor is affected can thus be defined as follows.
[Formula 1]                              Φ          k1k2                =                              ∑                          i              -              1                        m                    ⁢                      xe2x80x83                    ⁢                                    ∑                              j                =                1                            n                        ⁢                          xe2x80x83                        ⁢                                          Φ                ijkl                            ⁢                              Φ                jik2                                                                        eq        .                  xe2x80x83                ⁢        1            
However, xcfx86ijk must indicate whether an intersection exists between wiring nets and be defined as follows.
[Formula 2]      Φ    ijk    =      ⟨          xe2x80x83        ⁢                                        0            ⁢                          :                        ⁢                          xe2x80x83                        ⁢            no            ⁢                          xe2x80x83                        ⁢            wiring            ⁢                          xe2x80x83                        ⁢                          nets              ij                        ⁢                          xe2x80x83                        ⁢            intersecting                                                            1            ⁢                          :                        ⁢                          xe2x80x83                        ⁢            wiring            ⁢                          xe2x80x83                        ⁢                          nets              ij                        ⁢                          xe2x80x83                        ⁢            intersecting                                ⁢          xe2x80x83      
(2) Time Diversity
A fault that occurs in either of at-least dualized function blocks due to electric noise, etc. can be prevented from affecting the function block even when they are designed in the same way, by delaying the timings of their operations individually. To produce such a time diversity, the clock or input signal that establishes the timing of a dualized function block operation is supplied to one of the dualized function blocks through a delay circuit. When comparing the output signals from those function blocks, only the signal from the other function block can be output through the delay circuit to compare it with that of the former function block in the comparison circuit.
(3) Space Diversity
When separating one of the said at-least dualized function blocks from the other, it becomes possible to prevent temporary faults that occur in either of those function blocks due to electrical noise, cosmic rays, radiation, etc., as well as due to the damage of the subject semi-conductor chip from affecting the other function block. When a function block is dualized in a chip and each is checked by itself, the dualized function blocks should be arranged in the same direction and in the same pattern. With this, the effectiveness of the space diversity is maximized. The corresponding sections of the dualized function blocks can therefore have the same distance. As a result, it is possible to prevent the corresponding sections of the dualized function blocks from coming close to each other excessively and to deteriorate the effectiveness of the space diversity.
According to this invention, the design diversity, the time diversity, and the space diversity can guarantee independence of the faults to be detected in any of the at-least dualized function blocks by comparing the outputs from both of the function blocks. With this, occurrence of the same type faults at the same time can be eliminated with a correlation in both the dualized function blocks. It also becomes possible to detect faults by comparing the outputs from those function blocks.
A second advantage of the present invention relates in particular to the fact that a distributed fault tolerant system having a plurality of computer modules assigned to execute a plurality of tasks comprises selection and execution means that, if a fault occurs in any of the computer modules of the system, selects at least one of the computer modules having tasks assigned thereto other than the task of the broken computer module, assigns to the selected computer module the task that the broken computer module has executed, and makes the selected computer module execute the task.
Each of the computer modules of the present invention operates as follows:
(1) The computer module broadcasts its fault occurrence information (fault detection results) and process results to the other computer modules at a proper timing (check points) during processing a the task.
(2) The computer modules calculate their respective evaluation functions Fij, where i is a processor number and j is a task number. The evaluation function Fij can be regarded as a margin for the responsibility to be taken on by the computer module for the task. It is based on equality or inequality of the fault occurrence information (fault detection results) and process results broadcast from the other computer modules.
(3) Each of the computer modules decides task j for minimizing the evaluation function Fij as a process to execute before switching the task in process to the process to be executed.
The evaluation function Fij represents a margin of reliability of the task. Therefore, it should be determined so that Fij can be reduced as the importance of the task is increased, Fij can be reduced as the responsibility of the computer module for the task is increased, and Fij can be increased as the reliability of the task is increased.
An example of the evaluation function Fij meeting the conditions mentioned above is
Fij=Lrjxe2x88x92Lthij, or
Fij=Lrjxe2x88x92Lthij
where Lthij is a threshold value of the reliability level of task j in the computer module i, Lrj is the reliability level of task j, i is the computer module number, and j is the task number.
Another example of the evaluation function Fij meeting the conditions mentioned above is
Fij=log{(1xe2x88x92Lthij)/Pej}
where Pej is a probability of wrong calculation results of task j.
It should be noted that Lthij, that is the threshold value of the reliability level of task j, is different depending on the importance of the task. It is set to a high value as the task is needed to have high importance or high reliability.
Further, Lthij has to be different depending on the computer module. It has to be high as the responsibility of the computer module is high for the task.
With the second feature of the present invention, the computer modules are assigned to the tasks so that the evaluation functions Fij can be made to always balance. This will not make Fij of a specific task jut out too high or low. That is, if there is a specific task of low reliability level (hereinafter referred to as an endangered task) due to occurrence of a fault during operation, a computer module in execution of another task having a margin of reliability is made to execute the endangered task. This can prevent the reliability level of the specific task alone from being lowered. For this reason, the second feature can countermeasure any occurrence of fault during execution of the tasks so that the responsibility given to the system can be fulfilled while the reliability is maintained.
Also, since Lthij is set high as the importance of a task is high, Fij can be balanced with the other tasks at a higher Lrj. For this reason, a larger number of computer modules should be assigned to a task whose importance is high to keep a higher reliability level Lrj.
Further, since each of the computer modules can autonomously decide which task to execute, it is necessary to have a central arrangement for assigning task executions, thereby causing no single fault points. This means that a single fault will not affect the whole system, thereby making it possible to increase system reliability.