High-availability clusters (HA Clusters) are computer clusters that are implemented primarily for the purpose of improving the availability of the services that the cluster provides. Computer clusters generally have secondary, redundant components which can be used to provide service when a primary component fails. HA Clusters are implemented to improve the provided service by automatically switching from one or more primary components to one or more redundant components when appropriate, also known as failing over.
VERITAS CLUSTER SERVER (VCS) software, for example, can be used to reduce application downtime in HA Clusters. It provides application cluster capabilities to systems running databases, file sharing on a network, electronic commerce websites or other applications. VCS software is available from Symantec Corp. of Cupertino, Calif. MICROSOFT CLUSTER SERVER (MSCS) software also provides cluster capabilities to increase the availability of applications. MSCS software is available from Microsoft Corp. of Redmond, Wash. Both VCS and MSCS software are cluster drivers.
Generally, in order for two redundant storage systems in a cluster to be useful in protecting the cluster against a site disaster, one of the two storage systems must include a copy of the data in the other storage system. Replication software is typically used to copy the data on a primary storage system. For example, the SRDF replication software that runs on SYMMETRIX data storage systems can be used to copy data from one SYMMETRIX data storage system to another. The SRDF family of replication software and the SYMMETRIX family of data storage systems are both available from EMC Corp. of Hopkinton, Mass.
FIG. 1 illustrates a typical implementation of a cluster 100 using the SRDF family of replication software and an MSCS cluster driver. Cluster 100 includes components in two different locations. Each location includes a data volume 120 that provides data to the co-located server 140 through a switch 160. The data on primary data volume 120-1 is replicated on secondary data volume 120-2 via link 194. Cluster drivers run on each of the servers 140-1, 140-2. The servers communicate via link 192. When the primary server 140-1 is unable to run an application, the secondary server causes the application to fail over to the secondary server 140-2. The SRDF/CE software, which runs on each server, makes sure that the data volumes are ready for usage.
The inventor of the present invention recognized some limitations of cluster 100. For instance, all fail overs from one location to another are treated in the same manner—whether they are induced by a server failure or routine maintenance, on the one hand, or a data volume failure or a data volume access failure, on the other hand. Moreover, each cluster driver requires a different implementation of the cluster-enabling portion of the replication software. The cluster-enabling portion of the replication software only has access to limited information on the state of data volumes in cluster 100. Additionally, fail overs from one location to another are not handled optimally. Human intervention may be required to restart an application on the server at the surviving location. Finally, the location that continues operations after a failure of both link 192 and link 194 is uncertain.