The present invention generally relates to methods of recovering exclusive control instructions and multi-processor systems using the same, and more particularly to a method of recovering an exclusive control instruction related to dual shared memories which are shared by a plurality of processors, and to a multi-processor system using such a method.
A multi-processor system executes separate processes in a plurality of processors so as to improve the performance of the system as a whole. In such a multi-processor system, an exclusive control must be made among the processors for the purpose of avoiding a contention related to the resource.
As a method of realizing the exclusive control, there is a system which provides an area for exclusive control in a shared memory to which an access can be made from all of the processors, and gives the right to use the resource exclusively depending on the content within this area. According to this system, all of the processors carry out the same procedure, that is, the content is first read and the write is made by checking whether or not the conditions match, and the access from another processor during the series of operations is prohibited. In addition, when the shared memory has a dual construction, the shared memory modules from which all of the processors make the read operation must have the same construction, and the rewrite operation must be made with respect to both the shared memory modules forming the dual construction.
The present invention relates to the method of recovering the exclusive control instruction in a computer system which is provided with the shared memory modules having the dual construction, when an abnormal termination of the exclusive control instruction occurs due to intermittent (occasional) failure or the like of the system.
FIG. 1 is a diagram for explaining the operation of executing the exclusive control instruction in a conventional computer system which includes shared memories having the dual construction. FIG. 1 shows a case where data are written into the shared memories.
In FIG. 1, a shared memory module 171 includes a shared memory unit 171a and a bus connection unit 171b, and a shared memory module 172 includes a shared memory unit 172a and a bus connection unit 172b. These shared memory units 171 and 172 for the dual construction. The shared memory units 171a and 172a are coupled to a system bus 173 via the respective bus connection units 171b and 172b. For example, the shared memory module 171 has a unit identification (ID) which is ID=0000000, and the shared memory module 172 has a unit ID which is ID=0000001.
On the other hand, a processor module 174 includes a central processing unit (CPU) 174a, a main memory 174b and a bus connection unit 174c which connects the system bus 173 and the CPU 174a or the like. For example, the processor module 174 has a unit ID which is ID=1100000. Actually, a plurality of such processor modules 174 are connected to the system bus 173.
In FIG. 1, when the CPU 174a of the processor module 174 executes an exclusive control instruction, the read operation with respect to the shared memory module is first recognized by the bus connection unit 174c, and the bus connection unit 174c makes access to the shared memory module 171 in the master system (hereinafter simply referred to as the master shared memory module 171) having the ID=0000000 to make the read operation. The data in an exclusive control region of the master shared memory module 171 are read out as indicated by "(1)R" in FIG. 1, and the processor module 174 judges whether or not the read data match with expected value. If the data do not match, it is regarded that the exclusive control has failed, and the process advances to a next instruction by terminating this instruction.
On the other hand, when the data match, the operation of rewriting the contents of the shared memory modules 171 and 172 is started. When the bus connection unit 174c recognizes this rewriting operation, an access is first made to the master shared memory module 171 having the ID=0000000 to make a write operation as indicated by "(2)MW" in FIG. 1. After receiving a notification indicating a normal termination, an access is made to the shared memory module 172 of a slave system (hereinafter simply referred to as the slave shared memory module 172) having the ID=0000001 to make a write operation as indicated by "(3)SW" in FIG. 1. When a notification indicating a normal termination of the write operation to the slave shared memory module 172 is received, the bus connection unit 174c notifies the termination to the CPU 174a, and the exclusive control instruction is terminated.
The bus connection unit 174c makes an abnormality notification to the CPU 174a in synchronism with the exclusive control instruction when an abnormality is detected during the read access to either the master or slave shared memory module 171 or 172, the write access to the master shared memory module 171, and the write access to the slave shared memory module 172. In response to this abnormality notification, the CPU 174a checks the contents of the abnormality notification, and the process is continued by retrying (or re-executing) the exclusive control instruction if a recovery is possible. But if the recovery is impossible, the CPU 174a judges by itself that the operation is impossible and comes to a halt, so as to make a macro recovery using other processor modules.
For example, the recovery becomes impossible when the abnormal termination of the exclusive control instruction occurs even though the write operation is made to the master shared memory module 171 or, the rewriting to the master shared memory module 171 cannot be guaranteed. In such cases, even if the exclusive control instruction is retried, the data in the master shared memory module 171 is already rewritten, and the data read from the master shared memory module 171 will not match with the expected value.
When the recovery is impossible, the CPU of the certain processor module halts itself and expects other processor modules to make the recovery. When the other processor modules detect the halt of the CPU of the certain processor module, the other processor modules release the exclusive control region acquired by this CPU. In other words, the other processor modules write back the contents of the exclusive control region of the master shared memory module, and the resource becomes usable again by carrying out this process. In addition, the process of the halted CPU is retried from the start by the CPUs of the other processor modules based on inherit information stored in the shated memory module.
Therefore, when the abnormal termination of the exclusive control access cannot be recovered, the conventional system had problems in that the processor module must be halted and the recovery had to be made by the other processor modules, and the load of the recovery process was large.