1. Field of the Invention
The present invention relates to a cluster computing system and its failover method, where the cluster computing system is a duplex clustering system that has an active system and a standby system, and that can operate both systems without inconsistencies even in the event of a failure of networks that connect the sites of the two systems and recover from the failure.
2. Related Background Art
A cluster computing system is a system that operates a plurality of servers as one system in order to enhance the availability and reliability of the system. A function in which an alternate server takes over data and processing in the event of a failure in such a cluster computing system is called a “failover,” which is an important technology in improving the reliability of the system.
Generally, a cluster computing system has one storage apparatus system shared by a plurality of host computers or a plurality of storage apparatuses each separately accessible by one of a plurality of host computers. In a configuration called a duplex system, one of a plurality of host computers operates as an active host computer and performs operation processing, such as data read/write to and from the storage apparatuses, while other host computers are in a standby state as standby host computers. The active host computer and the standby host computers monitor the status of each other, such that when a failure occurs on one of the host computers, the other host computer detects the failure and takes over the operation processing. A cluster computing system using a technology to perform private communications (heartbeat communications) by selectively using links between sites is disclosed as a cluster computing system described above.
The configuration and operations of a typical prior art cluster computing system are described below with reference to FIGS. 17-20.
FIG. 17 is the general system configuration of a cluster computing system in which a storage apparatus system is shared by host computers.
FIG. 18 is a diagram illustrating a situation in which a failure occurs in a host computer A in the system shown in FIG. 17.
FIG. 19 is the general system configuration of a cluster computing system in which a separate storage apparatus system is provided at each site.
FIG. 20 is a diagram illustrating a situation in which a failure occurs in a storage apparatus system A of a site A in the system shown in FIG. 19.
In a cluster computing system in which a storage apparatus system is shared by host computers, a host computer A10, which is an active host computer, and a host computer B11, which is a standby host computer, are connected to a storage apparatus system 50 by interface cables A40 and B41, respectively, to perform I/O requests.
The host computer A10 that performs operation processing and the host computer B11 that is in a standby state have heartbeat communications with each other via an IP network 30, which connects them, in order to monitor the status of each other.
A disk volume that is logical (hereinafter called a “logical disk volume”) that the storage apparatus system 50 has is accessed by the host computer A10 under normal operating state.
If a failure occurs on the host computer A10 as shown in FIG. 18, the standby host computer B11 detects the failure through the IP network 30 and begins operations; the host computer B11 takes over the processing of the host computer A10 and accesses the storage apparatus system 50.
A cluster computing system having such a configuration can maintain operation processing even if a failure occurs on one of the host computers.
However, if a failure occurs on the storage apparatus system itself that stores the data necessary for operation processing, the operation processing cannot be continued in such a cluster computing system.
For this reason, a configuration shown in FIG. 19, in which an active host computer and a standby host computer have separate storage apparatuses, may be considered.
In a cluster computing system with the configuration shown in FIG. 19, a site A100, which is active, and a site B101, which is standby, have a storage apparatus system A51 and a storage apparatus system B52, respectively. Remote copying takes place at all times between the two storage apparatus systems A51 and B52. Remote copying is a technology in which a plurality of storage apparatus systems installed at physically remote locations copy data (dual writing) without the intervention of any host computers.
For performing remote copying, the storage apparatus systems are connected to each other by a dedicated line or a public telephone line (or a Fibre Channel (FC) network 90 in FIG. 19), such that a copy source logical disk volume of the storage apparatus system A51 of the active site A100 is copied to a copy destination logical disk volume of the storage apparatus system B52 of the standby site B101.
In this way, the storage apparatus system B52 operates as a backup system for the storage apparatus system A51, thereby maintaining the consistency of data.
When a failure occurs in the storage apparatus system A51 as shown in FIG. 20, the host computer A10 detects the failure and reports via an IP network 30, which connects the sites, to the host computer B11 that a failure has occurred in the storage apparatus system A51.
In the meantime, the storage apparatus system B52 of the site B101 also detects via the FC network 90 that a failure has occurred in the storage apparatus system A51, which is the remote copy source.
Upon receiving the report from the host computer A10 and checking the status of the storage apparatus system B52 via an interface cable B41, the host computer B11 recognizes that a failure has occurred in the storage apparatus system A51 of the site A100 and performs an operation to take over the operation processing.
When the storage apparatus system A51 of the site A100 recovers to a state where it can execute operation processing after the operation processing has been taken over by the site B101 from the site A100, the storage apparatus system B52, to which is connected the host computer B11 that took over the operation processing, is set as a remote copy source, while the recovered storage apparatus system A51 is reset as a remote copy destination; by performing remote copying in this manner, data can be recovered to the storage apparatus system A51 without suspending the operation of the entire system.
In the prior art configuration in which different sites have their own storage apparatus system as shown in FIG. 19, operation processing can be continued even when a failure occurs in the active storage apparatus system itself, unlike the system shown in FIG. 17 where a plurality of host computers shares a storage apparatus system.
In a cluster computing system, host computers must constantly monitor each other in order to be able to take over operation processing. In the system shown in FIG. 19, the heartbeat communications on the IP network that connects the host computers and the remote copying on the FC network that connects the storage apparatus systems are utilized to check the status of the counterpart sites and recognize any occurrence of failure. The system, however, may not recover smoothly from failures, depending on the mode of failure occurrence.
Referring to FIGS. 21 and 22, a description is made as to situations in which the system does not recover smoothly from failures, depending on the mode of failure occurrence.
FIG. 21 is a diagram illustrating a situation in which a failure occurs on the IP network 30 and the FC network 90 in the configuration shown in FIG. 19.
FIG. 22 is a diagram illustrating a situation in which a failure occurs at the site A in the configuration shown in FIG. 19.
One mode of failure occurrence is a situation in which, as shown in FIG. 21, both the IP network 30 that connects the host computer A10 of the site A100 with the host computer B11 of the site B101, and the FC network 90 that connects the storage apparatus system A51 of the site A100 with the storage apparatus system B52 of the site B101 become disconnected (hereinafter called a “total disconnection of networks between sites” or simply a “total inter-site network disconnection state”).
In such a situation, since there is no means of communication between the sites, the sites cannot monitor the status of each other.
Another mode of failure occurrence is a situation in which, as shown in FIG. 22, both the host computer A10 and the storage apparatus system A51 of the site A100 fail simultaneously, for example, which causes the entire system within the site A100 to fail (hereinafter called a “site failure”).
From the perspective of the host computer B11 of the site B101, since it cannot obtain information from the site A100 in either situation, the host computer B11 cannot determine whether the problem is a total disconnection of networks between sites shown in FIG. 21 or a site failure shown in FIG. 22.
Generally in conventional cluster computing systems, when a total disconnection of networks between sites occurs and the sites become incapable of monitoring each other, the site that did not fail cannot obtain information about the counterpart site at all; this situation leaves the following three options available based on logic:
(1) a state in which operation processing is executed at both sites regardless of the status of counterpart sites (i.e., a split brain state);
(2) a state in which operation processing is halted at both sites regardless of the status of counterpart sites; and
(3) a state in which operation processing is continued only at the site that had been executing operation processing until then.
For example, when the (1) split brain state results from a total disconnection of networks between sites shown in FIG. 21, the host computers at both sites update data in the logical disk volumes in their respective storage apparatus systems, which causes data in the remote copy source and data in the remote copy destination to be inconsistent.
In reality, if in the total disconnection of networks between sites in FIG. 21 the host computer A10 of the site A100 determines that a site failure has occurred at the site B101 and continues to operate, while the host computer B11 of the site B101 also determines that a site failure has occurred at the site A100, the (1) split brain state results.
Consequently, although the site B101 must remain in a standby state while the site A100 remains active in the total disconnection of networks between sites in FIG. 21, the site B101 cannot control logically since it cannot differentiate the total disconnection of networks between sites from the site failure in FIG. 22. In the situation in FIG. 22, unless the site B101 begins operation upon confirming a failure, it would result in (2) state.
Consequently, the prior art in general entails the problem of not being able to ensure the reliability of the system as a whole unless the standby site B101 can differentiate the total disconnection of networks between sites from the site failure.
Also, when the cluster computing system is provided with a plurality of routes or links to be selected between sites, such a cluster computing system may be more sound and more reliable than ordinary cluster computing systems without such selectable links. However, even such a system cannot cope with hazards such as large-scale fire of a site itself or of a communication route between sites when the communication condition is extremely poor.