1. Field
The present disclosure relates to methods and systems for managing failover events in computer servers.
2. Description of Related Art
Failover is a backup operational mode in which the functions of a system component (such as a processor, server, network, or database, for example) are assumed by secondary system components when the primary component becomes unavailable through either failure or scheduled down time. Failover protocols and components are often implemented for mission-critical systems that must be constantly available or for busy commercial systems in which down time causes substantial lost income or expenses. A failover procedure transfers tasks to a standby system component so that failover is not discernable, or is only minimally discernable, to the end user. Failover events are triggered when a failure of a primary component is detected, causing the failover protocol to be implemented to shift tasks to the designated back-up component. Automatic failover procedures maintain normal system functions despite unplanned equipment failures.
Modern server systems, such as the storage area network (SAN), enable any-to-any connectivity between servers and data storage systems but require redundant sets of all the components involved. Multiple connection paths, each with redundant components, are used to help ensure that the connection is still viable even if one (or more) paths fail. Multiple redundancies increase the chances that a failover event will not be disruptive, but increase system capital and operating costs. In systems with fewer redundancies, or where it is deemed undesirable to maintain more than one active component at any given instant of time, rapid failure detection is key to triggering a seamless failover response. Current failover systems rely on monitoring the “heartbeat” of a primary component by the backup component; that is, the components communicate by exchanging packets called “heartbeats” at regular periods, for example approximately 2 Hz (twice per second). If the primary component fails, the secondary node will fail to receive the expected heartbeat packet, which immediately triggers a transfer of the primary role to the backup component. This prevents file corruption caused by multiple competing primary components. Notwithstanding the advantages of present failover systems, it is desirable to provide an improved failover system and procedures.