1. Field of the Invention
The present invention generally relates to a multi-processor data processing system in which a plurality of processors operates cooperatively, and more particularly to a fault detection in a multicast system in which a master processor and a plurality of slave processors are mutually synchronized and an identical message must be transferred from a single master processor to a number of slave processors.
2. Description of the Background Art
There has been a number of propositions and examples for a computer system in which a plurality of processors operate cooperatively, as an answer to a demand for a higher processing speed and a larger processing capacity which has been increasing constantly along with the development of computer technology.
In such a multi-processor data processing system, an efficiency of data transfer among the processors is a critical factor for determining the overall system performance, because a plurality of network linked processors must operate while mutually exchanging necessary data. Also, for correctly executing a plurality of processes to be carried out in a plurality of processors, it is necessary to have a data transfer method in which a transfer error can be prevented. In particular, a multi-processor configuration called a master-slave mechanism in which the processes are executed by transferring messages from a single master processor to a number of slave processors, it is necessary to carry out a multicast operation in which the same data is transferred to a plurality of slave processors at high speed and with high reliability.
There has been a number of propositions for realizing such a multicast system depending on different characteristics of different systems, which can be classified roughly into the following two types.
(1) Sequential transfer method PA1 (2) Shared memory method
In a loosely linked multi-processor system such as that shown in FIG. 1 to be described in detail below for example, in which a plurality of processors 1-0 to 1-n are linked through a network 2, a multicast operation is achieved in a form of one to one correspondence. Namely, in a multicast operation, the process of transferring data from a master processor 1-0 to one of slave processors 1-1 to 1-n and then receiving a reception completion message from that one of the slave processors 1-1 to 1-n at the master processor 1-0 is repeated for each one of the slave processors 1-1 to 1-n, as many times as the number of slave processors 1-1 to 1-n.
In this method, the process is quite tedious, and for n slave processors 1-1 to 1-n, one multicast operation requires 2n message transfers, so that the multicast operation can increase the traffic on the network 2 considerably and deteriorates an overall performance of the system significantly.
Moreover, in this method, when one of the slave processors fails to receive the data for some reason, the data transfers to the other slave processors ready to receive the data are also stopped at that point, so that the transfer efficiency is not very high.
In a densely linked multi-processor system such as that shown in FIG. 2 to be described in detail below for example, in which a plurality of processors 1-0 to 1-n are linked by sharing a shared memory 5, a multicast operation is achieved in such a manner that the master processor 1-0 writes the transfer data into the shared memory 5 and then each of the slave processors 1-1 to 1-n looks up the data in the shared memory 5.
In this method, the amount of messages to be transferred is smaller compared with the case of the sequential transfer method described above, but there has been a problem related to the reliability of the data transfer. Namely, it becomes impossible to guarantee the correct data transfer in such a case in which the data in the shared memory 5 are overwritten by new data before it is confirmed that the data have been received by all of the slave processors 1-1 to 1-n.
Moreover, even when there is a slave processor which does not look up the data in the shared memory 5 because of some malfunction, the master processor 1-0 cannot detect the existence of Such a malfunctioning slave processor, so that there has been a tendency for a fault recovery operation to be delayed.
Now, in addition, in a multicast system having a multi-processor configuration, because a plurality of processors must operate cooperatively, it is critically important for the stability of the system performance to detect the malfunction of the individual processor as soon as possible. In a case of the above described master-slave type multicast system, because the fault recovery procedure is different for a case of malfunction in the master processor 1-0 and a case of malfunction in the slave processors 1-1 to 1-n, it is particularly preferable to have a mechanism for identifying the malfunctioning processor in addition to a fast malfunction detection mechanism.
However, In a conventional multicast system, no practically effective method for such a malfunction detection in a multicast system having a multi-processor configuration has been available.
Moreover, in a conventional multicast system, even if it is possible to detect the malfunction within each individual processor itself, there has been a possibility for the fault operations due to the malfunction of that processor to affect the operations of the other processors.
In particular, in a case of the master-slave type multicast system described above, when the master processor 1-0 happens to malfunction, the operations of the slave processors 1-1 to 1-n which operate by receiving necessary data from the master processor 1-0 are likely to be affected by this malfunction of the master processor 1-0.
Thus, conventionally, it has been impossible to detect and identify the malfunction in the processor which is frequently communicating with the other processors in order to apply a necessary fault recovery procedure.