1. Field of the Invention
This invention relates to negotiated takeovers in high availability clusters that enable high availability to be maintained after suffering soft failures.
2. Related Art
File servers create a critical link to information that is accessed by system users. Data is the lifeblood of every corporation, and with the explosive growth of the Internet great emphasis has been placed on the ability of systems to deliver data to users quickly and efficiently. A major focus of these efforts is concern regarding how information can be provided when the system providing it suffers a failure.
Filer failures come in two basic varieties that are best described as xe2x80x9cHard Failuresxe2x80x9d and xe2x80x9cSoft Failures.xe2x80x9d For example, when a node in a high availability cluster becomes unable to communicate with other nodes in the cluster it is presumed to have suffered a hard failurexe2x80x94this is often characterized when a filer loses power.
Additionally, when a filer in a high availability cluster loses the ability to read a portion of a disk that it should be able to read, this is considered a Soft Failure as the filer is only partially impaired and is generally able to communicate with other nodes in the cluster.
The problem with the current state of the art is that soft failures are ignored by the cluster failover logic, so a filer that has suffered a soft failure continues to operate in whatever capacity it is able. This can be devastating to the overall performance of the filer cluster, thus this approach does not execute the most efficient use of available sources and severely impacts information delivery.
One known method of effecting a takeover in a high availability cluster occurs when a multi-node system is utilizing a protocol transmitted between the nodes that identifies that each node is still functioning. When this heartbeat-like message ceases from a node, the other nodes know that the node without a heartbeat has died. Consequently, one or more nodes in the cluster may take over some or all of the affected node""s tasks.
This method of takeover is widely available and quite effective, but it suffers from a severe drawback. It is oblivious to soft failures. For example, in a 2-node cluster, one of the nodes may be able to access only a portion of its designated storage areas due to a cabling problem. The impaired node, however, may be able to send the heart-beat message to the other node if effect fooling the other node into believing the affected node is fully functional when in fact it has suffered a soft failure and should be taken over.
Utilizing certain novel techniques a filer impaired by a soft failure can self diagnose or assist other filers in collectively diagnosing its operation. Through this diagnosis the filer can determine whether the problem is with some other component of the system or with its self. At this point it may continue operation in whatever capacity it is able or it may negotiate a shutdown and takeover in a controlled manner with one or more other filers.
For example, filer 1 in a 2-node high availability cluster may determine it cannot read disk 1 when it should be able to. It then asks filer 2 if it is able to read disk 1 knowing that filer 2 should be able to read disk 1. If filer 2 informs filer 1 that it can read disk 1 then filer 1 knows it is impaired and can take appropriate action. If filer 2 informs filer 1 that it is also unable to read disk 1, filer 1 can conclude that the problem is elsewhere and can take appropriate action. Additionally, filer 2 can take note that access to disk 1 is impaired but is not attributable to its operation.
In general, a takeover of one node by another is an all or nothing process and in the example above the appropriate action taken may include requesting filer 2 takeover while filer 1 shuts down until it is again fully functional. The invention, however supports the concept that under certain circumstances partial functionality may be passed to create load sharing based on criteria designed to create optimal utilization of resources.
This could occur when both nodes in a 2-node cluster have partially failed and some functionally is better than none at all. Both nodes could remain online, or one being more impaired than the other could shut down allowing the remaining node to takeover in whatever capacity it is able through a negotiated takeover process. This would allow the offline node time to be restored to a fully functional capacity and then a negotiated take over could occur to bring the restored node online. The process would then be repeated for the other impaired node resulting in a fully functional cluster and the best possible information availability while being executed.
Accordingly, it would be advantageous to provide a technique for takeover of a node in a high availability file server cluster after the node has suffered a Soft Failure so as to maintain high availability of information and use available resources to their maximum potential.
Thus, the invention includes a system and method for at least one node in a multi-node high availability cluster to declare itself impaired and request that that at least one other node takeover some or all of its functions. This situation may occur when a node suffering a soft failure notifies the other nodes in a cluster that it is in trouble and is requesting help from the other nodes. The other nodes can assist the affected node with a diagnosis of the problem through collective intelligence and comparison diagnostics or the affected node can self diagnose the problem.
Following this analysis stage an assisting node determines whether it is impaired or was recently impaired and is recovering from a failure. If the assisting node determines it is not impaired or recovering from recent impairment it may offer to takeover the affected node""s functions. The takeover process commences with the assisting node requesting the impaired node shutdown and a takeover timer is started. This gives the impaired node a predetermined time period in which to gracefully shutdown, and once it has shut down the assisting node takes over. If the affected node has not shut down at the expiration of the takeover timer, the assisting node sends kill messages to the affected node that force it to shut down. The assisting node then takes over the functions of the affected node.