The present invention relates to the field of computer network servers. More particularly, the present invention relates to a failover algorithm that identifies an active server cluster controller from available primary and secondary cluster controllers.
A cluster of servers may depend on one or more backend servers, also referred to as cluster controllers, for a crucial backend service, such as routing messages between servers in the cluster. A failure of the backend service may cripple the server cluster. Accordingly, backend server redundancy is desirable such that if a primary backend server or controller fails, a secondary backend server or controller will automatically take over active control of the server cluster.
What is needed is a reliable technique for ensuring that when a primary or secondary backend server fails while acting as the server cluster""s active controller, substantially all of the servers in the cluster will synchronously and automatically recognize that the other backend server has taken on the role of active controller of the server cluster.
The present invention includes a failover algorithm implemented in software. The failover algorithm does not rely on any failover-specific hardware. The failover algorithm allows servers in a cluster to determine whether a primary or secondary controller is active without requiring communication between the primary and secondary controllers.
One aspect of the invention includes a server cluster in which several servers are coupled to two servers, which are designated as a primary controller and a secondary controller. While the server cluster is operational, either the primary controller or the secondary controller will be actively controlling the cluster. The controller actively controlling the cluster is referred to as the active controller. Software running on the servers of the cluster, on the primary controller, and on the secondary controller, cooperates to ensure that each server will properly identify which controller is active at any particular time, including, but not limited to, upon starting up or bootstrapping the server cluster, upon adding one or more servers to a cluster that is already operating, and upon failure of an active controller, one or more servers, or a link between an active controller and a server.
According to one aspect of the invention, the failover algorithm includes the following steps performed by each server of a group of servers in the cluster for identifying which controller is active: making the server""s own assessment of an active controller; and identifying either the primary controller or the secondary controller as a consensus active controller based upon at least a quorum of the other servers"" own assessments of which controller is the active controller. Each of the servers in the group may also: notify the primary controller and/or the secondary controller of the server""s own assessment of which controller is the active controller; and query the primary controller and/or the secondary controller for the own assessments of other servers of the group of servers. Each of the servers in the group may also use at least a majority of the number of servers in the server cluster, excluding the primary and secondary controllers, as a minimum number of servers in a quorum.