1. Field of the Invention
The present invention relates to computer clusters and methods for cluster application recovery. More particularly, the invention concerns a recovery technique for improving cluster application availability during cluster recovery processing.
2. Description of the Prior Art
By way of background, managed data processing clusters are commonly used to implement the server tier in a client-server architecture. Instead of a single server providing application services to clients, application service functions are shared by an interconnected network of nodes (server cluster) operating cooperatively under the control of cluster management software. Responsibilities of the cluster management software commonly include the coordination of cluster group membership changes, fault monitoring and detection, and providing the server node application layers with distributed synchronization points. These cluster support functions allow the servers to implement a cohesive cluster application tier that provides a clustered service. Clustered services are advantageous because plural nodes can share application workloads and thus improve data processing performance as well as application availability. Exemplary applications that can run in a server cluster include network file systems, distributed databases, web servers, email servers, and many others.
Cluster architectures tend to use either a symmetric model wherein every node can service any application request, or they use an asymmetric/partitioned model wherein the application space is statically or dynamically partitioned across the cluster. According to the symmetric model, every node is homogeneous relative to the application services that the cluster provides, and there is no partitioning of the application space. Every node can process any request from clients of the clustered application. According to the partitioned model, there is static or dynamic partitioning of the application space (sometimes referred to as N-way logical partitioning), with each node servicing requests for the partition(s) that it owns.
Regardless of whether a cluster follows the symmetrical or partitioned model, the loss of a cluster node will not ordinarily bring down its applications or application partitions because the cluster management software can transfer the lost server's functions to another node. Nonetheless, the failure of a cluster node (or a communication link between nodes) is disruptive to cluster operations. When such failures occur, a process known as cluster recovery is initiated in order to restore the application functionality that was lost as a result of the failure. Unless the cluster architecture is fault tolerant, the cluster recovery procedure will nearly always result in a temporary interruption of an entire clustered application that spans the time period from fault detection until cluster recovery and application recovery completes. This cessation of application processing adversely affects application clients, including those connected to surviving nodes of the cluster. As such, near-continuous or even continuous application availability requirements are being increasingly placed on the recovery characteristics of cluster architecture-based products.
In general, the total duration of cluster recovery stems from some or all of the following activities associated with the recovery procedure:
1) Failure detection and validation;
2) Cluster recovery via synchronized cluster membership view updates;
3) Fencing of failed nodes (to halt application I/O operations);
4) Application partition failover (for logical partitioned architectures only);
5) Recovery of write-ahead logs; and
6) Application request re-routing.
That the foregoing recovery steps should result in cluster application disruption for the entire cluster recovery period is a direct result of the way traditional cluster management systems and cluster applications work. In particular, the integrity of cluster application transactional processing is premised on the cluster management software guaranteeing the integrity of the cluster and the application data. Because cluster integrity cannot be guaranteed in its entirety during cluster recovery, and because data integrity cannot be guaranteed until after fencing, failover, and write-ahead log recovery, traditional clustered application systems choose to pause all transaction activity during the total recovery period. Consistent with this design approach, most of the effort to improve cluster recovery to date has focused on reducing the duration of the individual steps that contribute to the total recovery time.
With respect to fault detection and validation, this time period can be reduced by implementing multiple redundant monitoring topologies to provide multiple data points for fault detection. For example, dual ring or triple ring heartbeat-based monitoring topologies (that require or exploit dual networks, for instance) can reduce failure detection time markedly. However, this approach has no impact on cluster or application recovery processing itself. The architecture also increases the cost of the clustered application.
With respect to cluster membership view updates (during cluster recovery), there is not much that can be done insofar as cluster management architectures are typically designed to serialize cluster recovery protocols and intra-cluster messaging protocols (the former pertaining to cluster recovery; the latter arising from application activity). As a result, no application activity can take place until the high priority cluster recovery protocol concludes. This by definition forces a cluster-wide pause or disruption in service.
With respect to the fencing of failed nodes and application partition failover, there is no associated cost if the cluster implements a symmetrical application architecture because client requests can simply be directed to another node. In the partitioned model, however, there is static or dynamic partitioning of the application space, with each node servicing requests for the partition(s) that it owns. In this architecture, the cost of application recovery will always include the cost of fencing and partition failover and thus bears an increased cost for application recovery in comparison to the symmetric model. Synchronous logging (as opposed to asynchronous write-ahead logs) or aggressive buffer cache flushing can be used to help reduce the failover cost, but both solutions affect steady state performance.
With respect to log-based recovery and application request re-routing, many cluster systems use a journaled/log architecture (e.g., databases, file systems) that determines the inherent log-based recovery characteristics as well as the continuity of application transactions. Typically, each node in a static or dynamic partitioning model uses a single write-ahead log (WAL) for all application partitions served by that node. In order to failover a partition from a failed node to a live node, the write-ahead log on the live node must first be truncated, which entails flushing the buffer cache as well as writing out the log pages to disk. Using a log architecture that maps the write-ahead log one-to-one to a logical partition of the application (as opposed to mapping it one-to-one with a node in the cluster) would provide greater transactional isolation between unaffected application partitions and affected partitions. As a result, there would be greater transactional continuity on unaffected partitions and shorter log-based recovery time for affected partitions. As used herein, the term “unaffected partition” refers to any partition that runs on a live (non-failed) node. In contrast, an “affected partition” is a partition that was being serviced by a node that has become unreachable (e.g., due to a fault, scheduled maintenance, or any other reason). The failover of an affected partition to a live node whose unaffected partition(s) have their own write-ahead log mappings will not affect such logs. A new write-ahead log will simply be created for the partition being failed over to the live node. However, implementing this type of log architecture would require a major re-write of many cluster application products and may not be practical. Nor would such an architecture scale well with a large number of partitions (in terms of storage space needed).
There are storage appliances that use hardware architectures with built in redundant access to the write-ahead log buffer in memory and the write-ahead log on disk. These systems naturally follow fault-tolerance principles rather than recovery-based models for high availability by using a synchronous log replication scheme between pairs of nodes. This allows a sibling node to take over from where a failed node left off. However, although synchronous log replication works very well in an active-active high availability solution, it is difficult to generalize the model for clusters without pairing nodes for synchronous log replication. This adds significantly to cost as well as complexity.