High-availability clusters, also known as failover clusters, are groups of computers that support server applications which can be reliably utilized with a minimum of down-time. High-availability clusters operate by harnessing redundant computers, or nodes, in groups or clusters that provide continued service when system components fail. If a server executing a specific application crashes without the support of a high-availability cluster, the specific application may be unavailable until the crashed server is fixed. High-availability clusters remedy such a server crashing situation by detecting hardware and/or software faults, and immediately restarting the specific application on another computer system without requiring administrative intervention, a process known as failover. As part of this process, clustering software may configure another computer system before starting the specific application on the other computer system. For example, appropriate file systems may need to be imported and mounted, network hardware may have to be configured, and some supporting applications may need to be running as well.
High-availability clusters are often used for critical databases, file sharing on a network, business applications, and customer services, such as electronic commerce websites. High-availability cluster implementations build redundancy into a cluster to eliminate single points of failure, including multiple network connections and data storage which is redundantly connected via storage area networks. High-availability clusters typically use a heartbeat private network connection to monitor the health and status of each node in the cluster. The term logical host or cluster logical host is used to describe the network address which is used to access services provided by a high-availability cluster. This logical host identity is not tied to a single cluster node. The logical host identity is actually a network address or hostname that is linked with the service(s) provided by the cluster. The term “logical host” may refer to a server application identified by the logical host identity. If a cluster node with a running database goes down, the database will be restarted on another cluster node, and the network address used to access the database will reference the new node so that the users can access the database again.
Data may be protected for a server application through the execution of a backup program that creates a backup of the data for the server application, and the execution of a restore program that restores the data for the server application from a previous backup of the data for the server application. Two key issues need to be addressed for a backup and restore application to protect data within a high-availability cluster environment. First, a backup and restore application needs to determine the unique owning host of each file system path name so that the backup and restore application has a consistent view of data no matter which high-availability cluster node is executing a server application. Second, if a server application is configured to be periodically backed up, then the backup and restore application is responsible for directing the backup program to the correct high-availability cluster node. Because server applications may fail over between high-availability cluster nodes, the backup and restore application's functionality does not necessarily apply to the same high-availability cluster node for a specific server application all of the time.
The resolution for the above two issues is implemented by a software framework that provides a platform independent solution for various high-availability cluster software on various platforms. The software framework uses a map object that queries the high-availability cluster configuration to provide a generally static “Internet Protocol address to file system path name” mapping. The map object is a program specific to the platform and high-availability cluster, but the software framework normalizes the output to be generic, such that the map object can be processed in a platform independent way. This normalization abstracts the logic of different high-availability clusters for the backup and restore application, and allows the decoupling of the backup and restore application from high-availability cluster product internals.
Using a software framework to address the first key issue of applying a backup and restore application to high-availability clusters is relatively straight forward. The output of the map object is presented in a normalized format in which an Internet Protocol address and its owning file system path names are grouped together. Therefore the ownership of each file system path is uniquely determined. The mapping from an Internet Protocol address to its owning file system path names is cached in an internal data structure to reduce subsequent executions of the map object because some operating systems or high-availability cluster commands called by the map object are considered expensive, especially for high-availability clusters with a large number of server applications. Because the mapping from an Internet Protocol address to its owning file system path names is rather stable in usual situations (that is, the cluster configuration does not change frequently in a production environment), it is reasonable to set the cache timeout value to be 30 days or even 365 days.
Using a software framework to address the second key issue of applying a backup and restore application to high-availability clusters is more challenging. When the task dispatcher of the backup and restore application receives a task request, the task dispatcher needs to direct the backup or restore task request to the correct node. While directing the task request is not an issue for a standalone environment, the challenge for a high-availability cluster environment is the uncertainty of which node is executing the target application at the moment. The backup and restore application may be executed in a high availability cluster environment itself. Furthermore, the method to direct a local task request and the method to direct a remote task request are different. Therefore the key for resolving the second issue is determining whether or not the destination of the task request is the local node that is executing the task dispatcher. The prior art solution for checking whether the task request destination is the local node is based on the output of the map object, which is cached in an internal data structure, to obtain the list of logical hosts executing on the local node. The subsequent task requests may or may not call the map object to update the cache, depending on when the cache was built relative to a defined time frame.