Fault tolerance, including the ability to recover from failures, is essential to the efficient operation of many computer systems and system components. ‘Failover’ recovery is a backup operational mode in which the functions of a system component (such as a processor, storage device, server or database) are automatically taken over by secondary system components when the primary component suffers a failure or becomes unavailable for other reasons.
In the past, when all stored data was connected to individual server computers in very basic point-to-point configurations, any failure of a single server could make data access impossible until the server recovered. More recently, developments such as storage area networks (SANs) have enabled any-to-any connections between servers and data storage systems. A failed path between a server and a storage system may result from the failure of any component in the path, but redundant components and multiple connection paths are typically provided within a storage network to ensure that connectivity remains possible when one or more components or paths fail. Automatic failover recovery enables normal functions to be maintained despite the inevitability of failures affecting components of a computer system.
A possible failover recovery scheme for dealing with server failures is to employ server-redundancy, with a secondary server having full access to state information of a primary server so that the secondary server can continue processing of commands when the primary server fails. The secondary server is made aware of a communication-initiator's possible reservations of resources that were initially accessible via the primary server. However, there is a significant overhead associated with maintaining detailed server state information at other servers.
A failover recovery solution could entail a secondary server using an IP address take-over mechanism so that all future commands targeted at a failed primary server will be received and handled by the secondary server. Instead of maintaining detailed state information for the primary server at a secondary server, any pending command that was not completed can be allowed to timeout (in some environments). Such a solution would typically require a status-checking mechanism such as a ‘heartbeat’ mechanism for the secondary server to detect a failure of the primary server—in addition to the overhead of the IP address take-over mechanism. As well as these overheads, such a solution would not automatically deal with dangling reservations (described below) and so reservation information would have to be saved persistently by the primary server to enable that information to be retrieved during recovery of a failed server. In a simple implementation, each server could have a backup server performing heartbeats and able to perform IP address takeover operations, but doubling the number of servers for redundancy is an expensive option.
A dangling reservation exists when a communication-initiator client has reserved a resource (such as a storage device) for exclusive use, but the initiator is no longer able to access the resource due to failure of the server that executed the reservation. The initiator client is unable to cancel the reservation and this could render the reserved resource unusable by any clients—unless another server has some mechanism for taking over management of existing reservations.