In a single computer network, clustering application software allows resources, or interfaces, to be executed by more than one server, or node, within the network. Clustered servers increase efficiency and play a vital role in handling complex business transactions.
A transition occurs when a resource is made available or unavailable on a given node. A transition can be classified as a failover or a failback. When a failover occurs, the resource is made available on a node via an online request. During failback, the resource is no longer made available on a node via an offline request.
An online request generally occurs when one node in a cluster goes down and the resource is needed on the surviving node. Failover enables the resources owned by the failed node to be taken over by the surviving node. Failover is complete when all of the node's resources are online on the new node.
An offline request generally occurs when a server node that has failed is restored. When the failed server node becomes active again, failback brings all the resources that were transitioned during failover back to the original node. Failback is complete when all the transitioned resources are restored to the original node via an online request and removed from the remaining node via an offline request.
Applications that are not cluster-aware will not failover to the surviving node in the case of a server failure. This causes stoppage of all the client jobs running on the failed node and loss of access to the resources on that node. The application has to wait for the server to come back up to resume the jobs. This proves to be a time-consuming and inefficient process.
Cluster-aware applications do not face this problem. If a node in the cluster goes down, the application fails over to the second node in the cluster and the client jobs continue without any interruption. The surviving cluster node picks up where the failed server left off. And when the failed node becomes available again, resources transitioned during failover are transitioned back to the original node during failback. Clients connected to the clustered servers need not know that a server failure occurred.
One prior art clustering system to which the method of the present invention generally relates is described in U.S. Pat. No. 5,964,886, entitled HIGHLY AVAILABLE CLUSTER VIRTUAL DISK SYSTEM. In this system, a cluster implements a virtual disk system that provides each node of the cluster access to each storage device of the cluster. The virtual disk system provides high availability such that a storage device may be accessed and data access requests are reliably completed even in the presence of a failure.
The method of the present invention makes use of a clustering system using a virtual disk system. Instead of implementing the clustering system, the method of the present invention builds upon the system, adding an additional type of resource capable of failover and failback. Thus, when one node in the cluster goes down or becomes inactive for any reason, the method of the present invention enables all the resources owned by that node to be taken over by the other node in the cluster.
Another prior art clustering system to which the method of the present invention generally relates is described in U.S. Pat. No. 5,852,724, entitled SYSTEM AND METHOD FOR “N” PRIMARY SERVERS TO FAIL OVER TO “1” SECONDARY SERVER. This invention is directed toward a system and method for server back-up. A set of Primary servers and at least one secondary server is coupled to a set of networks. Upon detecting a primary server status that indicates that a primary server is inoperative, the secondary server starts the duplicate set of the primary set of services corresponding to the now inoperative primary server.
The method of the present invention does not make use of a duplicate set of the primary services. Instead, the services, called resources, are transitioned immediately upon server failure. Before the failure, each server in the method of the present invention is capable of running its own resources. When one server goes down, the failed server transitions all of its resources to the remaining node. In the method of the present invention, groups of resources are brought online on the remaining server instead of using duplicate sets of the resources. Additionally, the resources become available on the remaining node when a server failure occurs.
Still another prior art clustering system to which the method of the present invention generally relates is detailed in U.S. Pat. No. 6,134,673, entitled METHOD FOR CLUSTERING SOFTWARE APPLICATIONS. This prior art method allows for fault tolerant execution of an application program in a server network having a first and second server, wherein the method includes: executing the application program in the first server; storing an object which represents the program in a cluster network database, wherein the object contains information pertaining to the program; detecting a failure of the first server; and executing the application program in the second server upon detection of the failure of the first server, in accordance with the information in the object.
The method of the present invention builds upon the above-mentioned clustering software. U.S. Pat. No. 6,134,673 describes a clustered server capable of detecting a failure in one server and executing the same application program in the second server. The method of the present invention is different because the application on each server is capable of running independently prior to a failure. In the event of a failure, resources from the failed server are transitioned to the remaining server node. Because the resources remain available to the client applications, clients connected to the clustered server need not know that a server failure occurred.
The concept of cluster-awareness is not a new one. However, previous implementations of cluster-aware applications require both servers to have identical configurations of the software being executed before a server failure occurs. In the event of a server failure, the applications are executed on the remaining server node. If a server failure does not occur, however, only one copy of the application is running. The remaining copy of the application is never utilized. Thus, the need arises to provide transitions from one server node to another while actively running the application on both servers.