A computer system requiring high reliability is configured to include: an active system computer for executing processing (application); and a backup system computer that takes over the processing in the event in which a failure occurs in the active system. A procedure for, as a result of detecting a failure occurring in the active system, instructing the backup system to take over the processing is provided by a cluster program. In addition, if the application makes use of data on a disk, the disk is shared between the active system and the backup system. In order to configure the backup system to take over the processing in the event in which a failure occurs in the active system, it is necessary to determine a computer used as the backup system from among computers constituting a cluster, and to take over resources (shared resources) which cannot be used at the same time, for example, a shared disk and an IP address, among resources that are used by the application and an operating system (OS). Moreover, in order to achieve higher reliability, it is also necessary to ensure that even in the event of a failure in which a path used by the backup system to monitor a failure of the active system is interrupted (network split), the active system and the backup system do not use the shared resources at the same time.
Cluster programs in the cluster configuration often use a method in which a backup system used to take over processing is determined by exclusively taking over a shared disk. This method is proposed by, e.g., Japanese Patent Laid-open No. 10-207855 (patent document 1).
Japanese Patent Laid-open No. 10-207855 discloses a technology in which using a mechanism for causing a backup system to stop an active system, a cluster program of the backup system resets the active system to release shared resources possessed by the active system, and then the backup system possesses the released shared resources to exclusively control the shared resources.
According to the patent document 1, in the computer system having the cluster configuration, if the backup system cannot monitor the active system, the backup system achieves the exclusive control of the shared resources by stopping the active system. However, in a cluster constituted of two systems, each of which is a backup system for the other, if a network split occurs, both the systems try to reset each other. Therefore, there is a possibility that all the systems will be reset. Accordingly, if a network split occurs, processing is interrupted, and consequently the high availability cannot be achieved. This means that a problem of conflicting reset (mutual reset) arises.
In addition, although the backup system resets the active system, the active system never reset the backup system. Accordingly, in a case where there is a cluster constituted of an active system and two backup systems (a backup system 1 and a backup system 2) that are used to take over processing of the active system, if a network split causes the cluster to be separated into a cluster constituted of the active system and the backup system 1, and the backup system 2, the backup system 2 resets the active system to perform system switching. On the other hand, because the active system has been reset by the backup system 2, the backup system 1 also detects a failure of the active system, and consequently performs system switching. As a result, both the backup system 1 and the backup system 2 are switched to an active system at the same time, which causes duplicated accesses to the shared resources. In another case, the first reset causes the failed system to reset recovery processing again, which delays the recovery of the failed system. This means that a problem of another conflicting reset (repeated reset) also arises.
These problems of the conflicting reset and the repeated reset can be solved by controlling the order, in which reset commands are issued, so that cluster programs, each of which issues a reset command, do not issue a reset command to each other at the same time. However, in this solution is used, if a failure occurs in a system whose reset-issuance order is the highest, a delay for a fixed period of time is caused until a system having the second highest reset-issuance priority completes the reset. Thus, there was a problem of a delay in the system switching.