The present invention relates to a system and method for providing a quorum service for high availability clusters to prevent false or unnecessary failovers.
Multi-processing systems are commonly configured in clusters of related systems to ensure high availability. These high availability clusters typically require the configuration of one or more heartbeats or communication paths between or among systems. A xe2x80x9cheartbeatxe2x80x9d is meant to include any type of brief, periodic communication signal which systems send to each other to insure that all systems in the cluster are functional. (Typically, a main system sends a message and the other systems repeat the message back to the main system to check system operability in the cluster.) The failure of all heartbeat mechanisms to a given system indicates that the system is dead (not functioning properly) and at least one of the remaining systems in the cluster needs to initiate a failover of any applications which were running on the affected system.
However, failed cables, failed routers, etc. can give the appearance that a system is not functioning properly when the system is actually still alive and functioning. Since the systems on one side of the network (i.e. one side of the failed cables) cannot communicate with the systems on the other side of the network, false failovers occur. This leads to the possibility of the same application being run on two or more different systems in the network, which can lead to data corruption. The possibility of the same application being run on two or more different systems in the network is especially high in cluster configurations in which there are no redundant heartbeat mechanisms, as well as in wide-area failover or disaster recovery configurations.
There is a need for a system and method which enables each system of a cluster to register with a quorum service which can assist in determining whether a failover is required.
In accordance with the teachings of the present invention, a system and method of providing a quorum service which each system of a cluster registers with prior to a potential failover to insure proper functionality of the cluster is provided. In particular, a method of preventing false or unnecessary failovers in a high availability cluster due to network failures, wherein said high availability cluster includes a plurality of systems, comprising the steps of providing a quorum service which each of said systems can independently communicate with; sending a registration signal from each system indicating that the system is operational when the failure of any system in the cluster is suspected; initiating shutdown procedures at a particular system if the particular system is unable to send a registration signal to said quorum service; requesting registration status by one or more of the systems other than the particular system that is unable to send a registration signal to said quorum service; and proceeding with failover activities by at least one of the systems other than the particular system that is unable to send a registration signal to said quorum service is provided.