A computer system which requires high reliability includes a currently-active system computer for executing a process (application) and a standby system computer which is capable of taking over the processing in case a malfunction occurs in the currently-active system. A procedure which is executed from the time of detection of a malfunction occurring in the currently-active system to the time the standby system is caused to take over the processing is provided by a cluster program. When the application uses data on a disk, the disk is shared between the currently-active system and the standby system. In order for the standby system to take over the processing in case a malfunction occurs in the currently-active system, it is necessary to select a standby system from the cluster computers, and, with respect to of resources used by the application and the operating system (OS), to take over the resource, which cannot be used at the same time (shared resource), such as a shared disk and an IP address. In order to realize higher reliability, it is also necessary to ensure that the currently-active system and the standby system do not use the shared resource at the same time, in the case where a malfunction occurs interrupting a path on which the standby system monitors a malfunction of the currently-active system (network split).
A method of selecting a standby system which takes over a process by exclusively taking over a shared disk in a cluster is performed by many cluster programs. As examples, reference is made to the below-listed patent document 1 and non patent document 1.
Patent document 1 describes a technique in which a mechanism for stopping a currently-active system, from a standby system, is used so that the standby system resets the currently-active system for releasing a shared resource owned by the currently-active system, and then the standby system owns the released shared resource for exclusively controlling the shared resource.
The non-patent document 1 describes a technique in which, in the case where a malfunction occurs in a currently-active system so as to perform a failover, a cluster program uses the commands RESERVE and RESET of available SCSI commands to exclusively control an access right to a shared disk. Here, RESERVE is a command for reserving an access right to a disk so that a RESERVE disk reserved by a certain computer denies an access and RESERVE from another computer. RESET is a command for releasing an access right of a disk so as to release an access right of the RESERVE disk.
[Patent document 1 [U.S. Pat. No. 6,138,248
[Non-patent document] Microsoft, Support Technical Information, 309186 (online, http://support.microsoft.com/kb/309186/en-us)
In patent document 1, in a cluster computer system, in a case where the standby system cannot monitor the currently-active system, it stops the currently-active system to obtain exclusive control of the shared resource. In a case network split occurs in a cluster of two computers, which constitute each other's standby systems, each of the systems resets the other, so that all of the systems can be reset. The process will be suspended at the time of a network split, so that high availability can not be reached.
Although the standby system resets the currently-active system, the currently-active system will not reset the standby system. When considering a cluster of a currently-active system and two standby systems capable of taking over it (standby systems 1 and 2), in the case of splitting a cluster of two computers of the currently-active system and the standby system 1 from the standby system 2 due to a network split, the standby system 2 resets the currently-active system to perform a failover. When the currently-active system is reset by the standby system 2, the standby system 1 also detects a malfunction of the currently-active system to perform failover. As a result, the standby systems 1 and 2 become currently-active systems at the same time, so as to cause a double access to the shared resource.
In accordance with non-patent document 1, in a cluster computer system, a standby system which cannot monitor the currently-active system includes a process forcefully releasing the control right of the currently-active system to a shared disk by use of the command RESET of the SCSI commands and a process of obtaining the control right of the shared disk released by issuing the RESERVE command of the SCSI commands from an arbitrary standby system. A system which takes over the shared disk, that is, a system which takes over the processing is determined by the two processes. When the latter RESERVE process is invalidated by the former RESET process, excessive failover occurs in such a manner that the process in which a take over is once performed with respect to a certain standby system by the command RESERVE is re-taken over by another standby system. To prevent this, enough time from the former RESET process to the latter RESERVE process is necessary to ensure that all of the standby systems complete the issuance of the RESET command. Irrespective of whether a network split actually occurs, the failover time can be delayed for a fixed time.
In accordance with this method, in a case network split occurs, failover can be performed. A further process for taking over succeeding the shared resource, other than a shared disk, e.g., of taking over an IP address, is necessary. However, the time required for completion of failover is increased so as to delay the failover time.