1. Field of the Invention
The present invention relates to a multiprocessor system having a redundant shared memory configuration, which includes a plurality of processors and also a plurality of shared system memories that can be used in common by the plurality of processors and which allows the shared system memories to have redundancy in writing data in these memories.
More specifically, the present invention relates to a multiprocessor system, which ensures that the contents of each pair of shared system memories are equivalent to each other, for example, in the case where same data is written in each pair of shared system memories having a dual shared memory configuration.
2. Description of the Related Art
In recent years, it has been necessary for a relatively large amount of data to be processed at high speed and with high reliability, especially in a field of data communication system using a computer system. To satisfy this requirement, a multiprocessor system has been developed, which is constituted by a plurality of processors each including a central processing unit (usually abbreviated to CPU). Such a multiprocessor system have an ability for processing the data much higher than that in a single processor, by effectively utilizing a plurality of central processing units.
Further, in the above-mentioned multiprocessor system having a plurality of processors, even if a certain processor has failed during operation, any other processor can continue to process the data in place of the failing one. Namely, the above-mentioned multiprocessor system has a redundant configuration in regard to the processors, which can provide a fault tolerant computer system.
Further, to make such a fault tolerant computer system more complete and to ensure a data integrity of the whole multiprocessor system, it appears indispensable that shared system memories provided for supporting a data process at high speed also have a redundant shared memory configuration, e.g., a dual memory configuration.
More specifically, in regard to these dual shared system memories, it is necessarily required that data stored in one of each pair of the dual shared system memories is equivalent to data stored in the other one so as to ensure a conformity of the respective data, especially, when same data is to be written in each pair of dual shared system memories.
However, in general, the situations that a conformity of the respective data fails to be ensured may be brought about mainly in the following three cases. Here, to simplify an explanation about such situations, it is assumed that a multiprocessor system having a number of processor modules includes only one pair of dual shared system memory modules.
(1) The first is the case in which a write operation in one module of the dual shared system memory modules is finished in normal termination, while a write operation in the other module of the dual shared system memory modules is finished in abnormal termination, when a write access to these dual memory modules is carried out by a given processor module. Namely, data is not completely written yet in the other module of the dual shared system memory modules.
However, in this case, the above-mentioned processor module by which a write access was carried out still continues to operate. By means of an abnormal termination message, the processor module can recognize a specified address to which a write access has failed, and therefore assuredly rewrite the data corresponding to the specified address by executing a data recovery process. Consequently, it can be finally ensured that the data written in one of the dual shared system memory modules is equivalent to the data reconstructed by a data recovery process in the other one, and a problem concerning the above-mentioned first case does not become not so serious practically.
(2) The second is the case in which at least one of the dual shared system memory modules determines that it is impossible to continue to perform a normal operation due to a contradiction which has occurred by the shared system memory module per se. In this case, since the shared system memory module cannot assuredly preserve the data that was once stored therein any more, the memory module stops operating after that time and assumes a state of "HALT" (hereinafter, a state of "HALT" will be simply referred to as HALT).
Here, the contradiction in the shared system memory module per se means a logical contradiction which generally occurs when hardware of the shared system memory module is brought out of control. More concretely, as that type of contradiction, an abnormality of a sequencer in a system bus controller which is a connecting unit to a system bus and which will be described hereinafter, an abnormality of another sequencer in a memory controller in the shared system memory module, or the like can be mentioned.
In this case, data that was stored in the shared system memory module assuming HALT is not reliable at all. Accordingly, to assuredly carry out a data recovery process for this type of shared system memory module assuming HALT, it is inevitable to copy or duplicate all the content of the other shared system memory module in a normal state to the shared system memory module assuming HALT. Such a copy process or duplication process is usually executed after the shared system memory module assuming HALT is brought in a state in which a normal operation thereof can be performed.
For example, in the case where the shared system memory module assumes HALT due to a recoverable trouble, etc., that has temporarily occurred by an error of a software type, the normal state thereof can be realized by resetting this memory module assuming HALT and by canceling a state of HALT. On the contrary, in the case where the shared system memory module assumes HALT due to a serious trouble, etc., that has eternally occurred by an error of a hardware type and is usually difficult to remedy, the normal state thereof can be realized only by replacing this memory module assuming HALT with a new memory module.
Generally, in carrying out the above-mentioned copy process of all the content of the normal shared system memory module, the larger the storage capacity of shared system memory module becomes, the longer it takes to complete the copy process. Therefore, a system bus of a multiprocessor system is likely to be occupied by a copy access of a certain processor module for executing such a copy process. Further, in the case where a write access is carried out by some other processor module with respect to the shared system memory module in which such a copy process is being executed by a certain processor module, the copy access command from a certain processor module is likely to contend with the write access command from some other processor module. As a result of such a contention, when the copy process is completed, a disadvantage may occur in that all the data stored in a shared system memory module by the copy process is not always equivalent to that of the other normal shared system memory module.
However, in almost every case among the above-mentioned second case, one of the dual shared system memory modules stops operating to assume HALT due to a trouble that has occurred by some error of a hardware type. In such a case, practically, the replacement of one of the dual shared system memory modules in a state of HALT with a new system memory module becomes necessary, so as to copy all the data of the other one of the dual shared memory modules to the new system memory module after the replacement. Namely, to deal with the shared system memory module in a state of HALT, it is inevitable to carry out troublesome work, such as the replacement of such an abnormal memory module.
Fortunately, it should be noted that a probability, in which a shared system memory module per se assumes HALT due to some error of a hardware type, is extremely low, and that a trouble concerning the above-mentioned second case does not become so serious practically.
(3) The third is the case in which a certain processor module among a plurality of processor modules assumes HALT during a write operation of the dual shared system memory modules; namely, the case in which a write operation is completed in one of the dual shared system memory modules, while a write operation is not completed yet in the other one thereof, at the time when this certain processor module among a plurality of processor modules assumes HALT.
Heretofore, even when a certain processor module assumes HALT in the third case in a multiprocessor system having dual shared memory modules, it is not confirmed whether or not a certain processor module in a state of HALT was carrying out a write access to the dual shared memory modules. Therefore, it cannot be known whether or not the respective data written in the dual shared memory modules is equivalent to each other. Consequently, even in the case where a processor module, which does not carry out a write access to the dual shared memory modules, assumes HALT, the multiprocessor system has been forced to conclude that a conformity of the respective data in the dual shared system memory modules is uncertain and cannot be ensured.
If a conformity of the respective data cannot be ensured as mentioned above, it is obliged to be supposed that the respective data in the dual shared system memory modules is not equivalent to each other. In this case, to carry out a data recovery process, in the similar manner to the above-mentioned second case in which one of the dual shared system memory modules assumes HALT, all the data of one of the dual shared memory modules (normal system memory module) is copied to the other one of the system memory modules.
In carrying out a copy process of all the data of one of the dual shared memory modules, the problems as described in the second case exist. More specifically, a first problem is in that it takes a relatively long time to complete the copy process; a second problem is in that a system bus of the multiprocessor system is occupied by the copy access by a certain processor module; and a third problem is in that a contention of the copy access command with a write access command occurs between two processor modules, in the case where a write access is carried out by some other processor module. Further, a state of HALT in the processor module is brought about not only due to some error of a hardware type, but also due to some error of a software type. Actually, in almost every case, the processor module assumes HALT by the error of a software type, unlike the case of a state of HALT in the shared system memory module per se.
Therefore, a probability, in which the processor module assumes HALT, is much higher than the case of HALT in the shared system memory module. Therefore, the problems concerning the above-mentioned third case becomes very serious.