Information drives business. A hardware or software failure affecting a data center can cause days or even weeks of unplanned downtime and data loss that could threaten an organization's productivity. For businesses that increasingly depend on data and information for their day-to-day operations, this unplanned downtime can also hurt their reputations and bottom lines. Businesses are becoming increasingly aware of these costs and are taking measures to plan for and recover from hardware and software failure and disasters affecting entire data centers.
One strategy to recover from failure of hardware and/or software is clustering, wherein computer systems and storage devices are interconnected, typically at high speeds, within a local data center. Clustering is used for various purposes, including improving reliability, availability, serviceability, and/or performance via load balancing. Redundant interconnections between the computer systems are typically included, and the collection of computer systems, storage devices, and redundant interconnections is referred to herein as a cluster. The cluster appears to users as a single, highly available system. Different types of clusters may be established to perform independent tasks, to manage diverse hardware architectures performing similar tasks, or when local and backup computer systems are far apart physically.
Often, computer systems within a cluster use a common pool of storage devices, and the purpose of the cluster is to provide an alternative processing resource for the data on the shared storage devices in the event of failure of one of the computer systems. In some clustering environments, only one of the computer systems in the cluster provides processing resources with respect to a particular software application. The computer system currently providing processing resources in the cluster for a particular software application is referred to herein as the primary node, and other computer systems in the cluster are referred to herein as backup, or secondary, nodes.
Each clustered computer system typically runs special software to coordinate the activities of the computer systems in the cluster. This software is referred to herein as a cluster manager. A cluster manager may monitor the “health” of sites in a distributed system and restart an application on another node when the node running the application fails. Typically, cluster management functions are limited to such clustering operations as monitoring, starting, and stopping resources. Communication between nodes in a cluster is typically limited to messages to check the “heartbeat” of other nodes in the cluster and to ensure proper operation of the cluster.
Clustering and storage technologies have grown substantially in recent years, and changes in one technology sometimes require changes in the other for interoperability. Most storage devices in use today are not specially adapted to operate in a clustering environment, and configuration data about the storage devices are typically maintained by host computer systems acting as servers for the storage devices. In some environments, configuration data about storage resources are maintained in files or databases on the host computer system. If a server for a given storage resource fails, configuration data about the storage resource can be inaccessible to other nodes in the cluster. A new node resuming operations of the failed node would be unaware of the configuration change and may be unable to communicate properly with the reconfigured storage resource.
What is needed is a system that enables other nodes in a cluster to resume operations of a failed node. These operations should include storage management services that allow configuration changes to be made dynamically to storage resources. Storage configuration information should be made available to some or all nodes in a cluster in as close to real-time as possible after making a storage configuration change. The solution should impose minimal or no overhead on operation of the nodes. If a node that has made a resource configuration change fails, the resource configuration change should be made available to another node resuming operations of the failed node.