In a computer-driven business world where any down time can result in lost profits and even loss of the business itself, data availability has become a primary motivation in designing fault-tolerant storage servers with every component protected against failure using redundancy. This includes building storage servers not just with dual cores, redundant array of inexpensive disks (“RAID”), redundant power supplies, redundant network and other ports, but also with dual controllers (i.e., dual redundant controllers). This implies dual, redundant memory and processor resources. Initiatives such as the storage bridge bay (“SBB”) specification with dual/multiple canisters, each housing a controller, within a single chassis have accelerated the move towards these architectures. A widely-held, but difficult-to-achieve, standard of availability for a system or product is known as “five 9s” (i.e., 99.999 percent) availability.
Even though it is difficult to achieve the “five 9s” availability in storage servers, “five 9s” availability can be nearly achieved using active-passive or active-active dual redundant controllers with minimal failover time. Upon failure of any component in one controller (e.g., a primary controller), the other controller (e.g., the secondary controller) is designed to take over the input/output (“I/O”) operations without interruption. The failover mechanism is designed to be very efficient and less time consuming. To achieve this goal, all of the underlying hardware devices and software modules (or layers) are designed to be ready and perform seamlessly without too much delay. Additionally, on failover, the secondary controller must be able to recover data and/or metadata quickly without any loss or corruption.
Conventional failover operations involve various steps that are performed one after the other (i.e., sequentially). These steps include making the disk subsystem or RAID subsystem ready for the secondary controller, mounting the block devices or file system, exposing the block devices or file system to the end user application, and ensuring the network connectivity to the storage device is fully functional. There are dependencies between the various layers (or modules) in the storage server software stack executing the steps above. It should be understood that the failover time can be quite lengthy depending upon the number of disks, number of block devices, number of network ports, etc. Additionally, if all of the layers in the storage server software stack are not ready within configured timeout periods, the I/O operations will fail from the client application. Efficient, dependable, and highly-available storage servers are designed in consideration of the dependencies between layers of the storage server software stack such that data availability from the client is not disrupted at any cost, without increasing the timeout values.