Storage arrays, i.e., storage systems having a plurality of storage devices (e.g., solid state drives (SSDs), magnetic disks, and/or optical drives) are becoming more widely used. Storage arrays provide the benefits of better data reliability (e.g., one storage device may fail, and data can still be retrieved from the remaining storage devices) and read/write throughput (e.g., multiple storage devices can service read/write requests in parallel), as compared to a storage system with only a single storage device.
With such benefits, storage arrays also come with some overhead. As part of the overhead is the need for a controller to facilitate access to the plurality of storage devices (e.g., determine which one or more of the storage devices should a data block be written to, determine which one or more of the storage devices should a read request be sent to for processing, etc.) A storage system, of course, is as robust as it weakest link. If its controller fails, the added redundancy provided by the multiple storage devices is of little use. Therefore, it is not uncommon for a storage system to include multiple controllers. Typically only one of the controllers is active (e.g., is responsible for facilitating access to the plurality of storage devices) at any time. A controller that is active may be called an active controller, whereas a controller that is currently inactive, but may become active upon failure of the active controller may be called a standby controller. A controller that has experienced a failure and is temporarily unable to either service any requests or even become active (e.g., is restarting as part of a software update) may be called a failed controller.
There are currently two main approaches for controlling which controller should be the active controller and which one should be the standby controller. A first approach employs a direct link between two controllers to transmit “heartbeats” from the active controller to the standby controller. Heartbeats may refer to periodically transmitted pulses, not unlike the human heartbeat. The standby controller expects to receive a heartbeat from the active controller once every time period H. Typical values of H are in the range of 1-2 seconds. Upon not detecting one or more successive heartbeats (typically 5 to 10 heartbeats) within an expected timeframe (e.g., 5 to 20 seconds), the standby controller assumes the active controller has failed, and the standby controller becomes the active controller (e.g., starts servicing any requests to the storage system).
Such approach, however, has some drawbacks. As one drawback, there is a lag between the time the active controller fails (and stops servicing requests) and the time the standby controller detects the failure of the active controller (and starts servicing requests). During this time lag (which may be equal or greater than time H), the storage system is unable to service any requests.
In a second approach, an auxiliary device is used to monitor the operational state of the two controllers. Upon the auxiliary device detecting the failure of the active controller (e.g., a heartbeat signal of the active controller could be transmitted to the auxiliary device, and failure of the active controller could be indicated by the lack of heartbeats), the auxiliary device instructs the standby controller to become the active controller (i.e., activates the standby controller). The second approach has drawbacks similar to the first approach, as there is likewise a lag between the time that the active controller fails and the time the auxiliary device activates the standby controller. During this time lag, the system is unable to service any requests.