The present invention relates to a multiprocessing device in which data processing units are multiplexed for providing an improved reliability.
In general, high reliability technology has two conceptual approaches, that is, the "fault-avoidance" approach and the "fault-tolerance" approach. The fault-avoidance concept is that faults which would cause errors are produced as little as possible, and the fault-tolerance concept is that an erroneous output is not produced even when a fault occurs within the system or, even if such an erroneous output is produced only a slight or negligible influence is given to the externally controlled device.
There are generally two methods for coping with a fault occurrence according to the fault-tolerance concept. One method is to completely mask the internal fault in a manner such that the system correctly functions as viewed externally, although a fault actually exists within the system. The other method is to increase a ratio of (up-time during which the system correctly functions) to (down-time during which the system does not correctly function). The former method is called a "static masking" and the latter method can be considered as a method to improve availability of the system.
Hitherto, there has been known a system based on the fault-tolerance concept which has redundant functional modules of the system, and thereby determines the majority consensus of outputs from the functional modules to provide its result to the next functional module. In this case, even when the output of one functional module is erroneous, such an erroneous output is masked, with the result that a correct input is applied to the functional module of the next stage. Namely, this system can completely mask the faults within the system (i.e. to apply the static mask thereto) so that the system correctly functions when viewed externally, and to prolong a time (up-time) during which the system correctly functions, thereby to improve fault-tolerance ability, although there exist faults within the system.
There has been known in the art a fault-tolerance multi-processor system as shown in U.S. Pat. No. 4,015,246 wherein three resources e.g., processors or memories etc. are grouped together as one unit (which is called a "triad" i.e. three sets). data transfer between these resources of the triad is carried out by using the result of the majority decision of the triad to improve the failure rate of the system. In this system, a triad of processors and a triad of memories are connected by a plurality of (three or more) buses. The individual processors of the triad take the majority consensus of input from the plurality of buses into the respective processors, and the individual memories of the triad take the majority consensus of inputs from the plurality of buses into the respective memories. In this instance, it is required for determining the majority consensus that the triad of processors or the triad of memories operate in synchronism at the clock level. When a single clock generator common to all the resources is used, there is the possibility that the entire system is down due to the fault of this clock generator. Accordingly, each resource determines the majority consensus of clock signals on a plurality of buses, thereby to obtain an internal clock signal. Further, since the plurality of buses are connected commonly to all the triad of processors and the triad of memories, these buses are isolated by duplexed bus guardians and bus isolation gates in order that they are not polluted by an output of a failed resource.
The conventional fault-tolerance multi-processing device stated above is configured so that all resources are interconnected by a plurality of (three or more) buses, and that each resource is provided with a logic circuit for determining the majority consensus of inputs from all the buses, and is further provided with a bus guardian and a bus isolation circuit for preventing the buses from being polluted. For this reason, the drawback with such a conventional device is that the amount of hardware needed for buses and bus input/output control units is increased as the multiplexity of the resources or the number of buses increases.
Further, in view of the processing ability of such a device, a large number of buses leads to the limitation that a broad bit width cannot be expected, with the result that the processing ability of the processor itself is lowered. In addition, since each resource is provided at the input/output control unit with the logic circuit for determining the majority consensus, the bus guardian and bus isolation circuit, there occurs the short-coming that the memory cycle time that the processor accesses into the memory is prolonged, i.e., the processing ability of the entire system is lowered.
In mutli-processing devices in synchronism with a clock, there is required a synchronization at the time of the starting of the system or a synchronization at the time when a processor temporarily separated from the system due to the occurence of a fault is recombined with the system. The essential condition therefore is that the contents of the memories to be subject to synchronization are the same, and that the informations (FFG, REG or flag etc.) within the processors to be subject to synchronization are the same. Under this condition, respective resources are synchronized with each other. It has been known in the art that copying between memories using buses commonly connected to the respective resources is relatively easy. Further, in the case of making a copy of an information within a particular processor (master processor) into another processor (slave processor, a method has been already known to conduct a direct communication from the master processor to the slave processor using buses commonly connected to the respective processors thereby to transfer the information. However, during synchronization of the clock level, it is practically difficult to execute such a direct communication while guaranteeing synchronization between processors, because processings in processors are different from each other.
ordinarily two cases require such a synchronization of the clock. The first case is a synchronous starting of a system from the condition where the system is down. The second case results from the fact that, when a system normally functions on the basis of the majority consensus determination, a failed processor, resulting from the fact that data is broken or intermittently becomes abnormal due to noise etc., is put into synchronization with other normal processors after the failed processor has recovered. In the former case of the synchronous starting of the system, a countermeasure can be taken to concurrently transfer data to multiplexed processors or memories, or start them together by means of an external service processor using a common bus. On the other hand, in the latter case of the recovery and the synchronization of failed processors and failed memories, when a copy of data from normal processors/memories to a failed processor/memory and the timing adjustment thereof are made while continuing the processing required for the system, on the basis of a simple direct communication or a starting and response system utilizing lead wires interconnected between processors, the hardware construction or starting procedures becomes complicated.