Information drives business. A disaster affecting a data center can cause days or even weeks of unplanned downtime and data loss that could threaten an organization's productivity. For businesses that increasingly depend on data and information for their day-to-day operations, this unplanned downtime can also hurt their reputations and bottom lines. Businesses are becoming increasingly aware of these costs and are taking measures to plan for and recover from disasters.
Two areas of concern when a failure occurs, as well as during the subsequent recovery, are preventing data loss and maintaining data consistency between primary and secondary storage areas. One simple strategy includes backing up data onto a storage medium such as a tape, with copies stored in an offsite vault. Duplicate copies of backup tapes may be stored onsite and offsite. More complex solutions include replicating data from local computer systems to backup local computer systems and/or to computer systems at remote sites.
Not only can the loss of data be critical, the failure of hardware and/or software can cause substantial disruption. Clustering is a strategy wherein computer systems and storage devices are interconnected, typically at high speeds within a local data center, for the purpose of improving reliability, availability, serviceability, and/or performance via load balancing. Redundant interconnections between the computer systems are typically included as well, and the collection of computer systems, storage devices, and redundant interconnections is referred to herein as a cluster. The cluster appears to users as a single highly available system.
Different types of clusters may be established to perform independent tasks, to manage diverse hardware architectures performing similar tasks, or when local and backup computer systems are far apart physically.
Often, computer systems within a cluster use a common pool of storage devices, and the purpose of the cluster is to provide an alternative processing resource for the data on the shared storage devices in the event of failure of one of the computer systems. Only one of the computer systems in the cluster generally provides processing resources with respect to a particular software application. The computer system currently providing processing resources in the cluster for a particular software application is referred to herein as an active node, and other computer systems in the cluster are referred to herein as backup nodes. The terms “active node” and “backup node” are used in the context of a particular software application, such that an active node for one application may serve as a backup node for another application, and a backup node for a third application may serve as an active node for yet another application.
Each clustered computer system typically runs special software to coordinate the activities of the computer systems in the cluster. This software is referred to herein as a cluster management application. A cluster management application may monitor the health of sites in a distributed system and restart an application on another node when the node running the application fails. Typically, cluster management functions are limited to such clustering operations as monitoring, starting, and stopping resources.
In many situations, disaster recovery requires the ability to move a software application and associated data to an alternate site for an extended period, or even permanently, as a result of an event, such as a fire, that destroys a site. For these more complicated situations, strategies and products to reduce or eliminate the threat of data loss and minimize downtime in the face of a site-wide disaster are becoming increasingly available.
Replication facilities exist that replicate data in real time to a disaster-safe location. Data are continuously replicated from a primary node, which may correspond to a computer system in control of a storage device, to a secondary node. The nodes to which data are copied may reside in local backup clusters or in remote “failover” sites, which can take over when another site fails. Replication allows persistent availability of data at all sites.
The terms “primary node” and “secondary node” are used in the context of a particular software application, such that a primary node for one application may serve as a secondary node for another application. Similarly, a secondary node for another application may serve as a primary node for that application.
The term “application group” is used to describe both an application and the corresponding data. If a primary application group on one cluster becomes unavailable for any reason, replication enables both the application and the data to be immediately available using the secondary application group in another cluster or site.
To accommodate the variety of business needs, some replication facilities provide remote mirroring of data and replicating data over a wide area or distributed network such as the Internet. However, different types of storage typically require different replication methods. Replication facilities are available for a variety of storage solutions, such as database replication products and file system replication products, although typically a different replication facility is required for each type of storage solution.
Replication facilities provide such functionality as enabling a primary and secondary node to reverse roles when both are functioning properly. Reversing roles involves such replication operations as stopping the application controlling the replicated data, demoting the primary node to a secondary node, promoting the original secondary node to a primary node, and re-starting the application at the new primary node. Another example of functionality of a replication facility involves determining when a primary node is down, promoting the secondary node to a primary node, enabling transaction logging and starting the application that controls the replicated data on the new primary node. In addition, when the former primary node recovers from failure, the replication facility can prevent the application from starting at the former primary node since the application group is already running at the newly-promoted node, the former secondary node. The transaction log can be used to synchronize data at the former and new primary nodes.
Both clustering and replicating data affect an application group. Clustering activities may involve starting and stopping a node or an application. Starting and stopping nodes and applications affects primary and secondary application group relationships, active and backup node relationships for cluster management, and primary and secondary node replication relationships. Replicating data from one node to another requires proper knowledge of whether nodes are online or offline to properly determine changes in the direction in which data should be replicated or to trigger taking over replication at one node by another node.
Currently, most cluster management applications and replication facilities are sold as independent products, and existing products do not coordinate clustering and replicating data. Situations that may leave data in an inconsistent state, such as continuing to replicate data to an offline node, may arise without coordination between cluster management applications and replication facilities.
What is needed is a framework for a management system that allows automated management of clusters and data replication. The management system should coordinate the management of clusters and replication so that the data produced by both activities are consistent. Cluster and replication activities should not interfere with each other, and the data should be recoverable in the event of disaster. Furthermore, the system should ensure that hardware and software failure can be detected and a backup hardware or software system can take over, while maintaining the replication of data.