“Clustering” is a known technique of connecting multiple computers (or host servers) and enabling the connected computers to act like a single machine. Clustering is used for parallel processing, for load balancing, and for fault tolerance. Corporations often cluster servers together in order to distribute computing-intensive tasks and risks. If one server in the cluster computing system fails, then an operating system can move its processes to a non-failing server in the cluster computing system, and this allows end users to continue working while the failing server is revived.
Cluster computing systems are becoming popular for preventing operation interruptions of applications. Some cluster computing systems have two groups of hosts (e.g., servers), wherein one host group works as the production system, while the other host group works as the standby system. One host group is typically geographically dispersed (e.g., several hundred miles) from the other host group. Each host group has its own associated storage system (e.g., a disk system). These two storage systems typically implement remote mirroring technology that is discussed below. Therefore, the associated storage system connecting to the standby host group contains the same data as the associated storage system connecting to the production host group.
The network connecting two host server groups is typically a Wide Area Network (WAN), such as the Internet. The two host server groups can communicate over the network to determine error checking, etc. WANs are not typically reliable since WANs are often subject to failure. Transfer of data across the Internet can be subject to delays and data loss. Therefore, because a standby host group may erroneously interpret a network problem (e.g., link failure or data transmission delay) as a failure state of the production host group, a standby host group may inappropriately take over the processes of the production host group (even if there is no failure in the production host group).
The host group in the production system may access a storage volume commonly known a primary volume (PVOL) in the associated storage system of the production system host group. Similarly, the host group in the standby system may access a storage volume commonly known a secondary volume (SVOL) in the associated storage system of the standby system host group. The primary volume (PVOL) is mirrored by the secondary volume (SVOL). A storage system may have both PVOLs and SVOLs.
Storage-based remote mirroring technology creates and stores mirrored volumes of data between multiple storage volumes maintained over a given distance. Two disk systems are directly connected by remote links such as an Enterprise System Connectivity (ESCON) architecture, Fibre Channel, telecommunication lines, or a combination of these remote links. The data in the local disk system is transmitted via remote links to and copied in the remote disk system. These remote links are typically highly reliable, in comparison to a usual network such as the Internet. If an unreliable remote link fails, then this failure may disadvantageously result in the loss of data.
U.S. Pat. Nos. 5,459,857 and 5,544,347 both disclose remote mirroring technology. These patent references disclose two disk systems connected by remote links, with the two disk systems separated by a distance. Mirrored data is stored in disks in the local disk system and in the remote disk system. The local disk system copies data on a local disk when pair creation is indicated. When a host server updates data on the disk, the local disk system transfers the data to the remote disk system through the remote link. Thus, host operation is not required to maintain a mirror data image of one disk system in another disk system.
U.S. Pat. No. 5,933,653 discloses another type of data transferring method between a local disk system and a remote disk system. In synchronous mode, the local disk system transfers data to the remote disk system before completing a write request from a host. In semi-synchronous mode, the local disk system completes a write request from the host and then transfers the write data to the remote disk system. Subsequent write requests from the host are not processed until the local disk system completes the transfer of the previous data to the remote disk system. In adaptive copy mode, pending data to be transferred to the remote disk system is stored in a memory and transferred to the remote disk system when the local disk system and/or remote links are available for the copy task.
There is a need for a system and method that will overcome the above-mentioned deficiencies of conventional methods and systems. There is also a need for a system and method that will increase reliability of cluster computing systems and improved failure detection in these computing systems. There is also a need for a system and method that will accurately detect failure in the production host group of a cluster system so that the standby host group is prevented from taking over the processes of the production host group when the production host group has not failed.