A storage system is a computer that provides storage service relating to the organization of information on writeable persistent storage devices, such as memories, tapes or disks. The storage system is commonly deployed within a storage area network (SAN) or a network attached storage (NAS) environment. When used within a NAS environment, the storage system may be embodied as a file server including an operating system that implements a file system to logically organize the information as a hierarchical structure of directories and files on, e.g. the disks. Each “on-disk” file may be implemented as a set of data structures, e.g., disk blocks, configured to store information, such as the actual data for the file. A directory, on the other hand, may be implemented as a specially formatted file in which information about other files and directories are stored.
It is advantageous for the services and data provided by a storage system, such as a filer or storage appliance, to be available for access to the greatest degree possible. Accordingly, some computer storage arrangements provide a plurality of storage systems in a cluster, with a property that when a first storage systems fails, the second storage system is available to take over and provide the services and the data otherwise provided by the first storage system. When a first storage system fails, the second storage system in a cluster assumes the task of processing and handling any data access requests normally processed by the first storage system. One such example of a cluster configuration is described generally in U.S. patent application Ser. No. 09/933,866 entitled OPERATOR INITIATED GRACEFUL TAKEOVER IN A NODE CLUSTER by, Samuel M. Cramer et al., now issued as U.S. Pat. No. 6,920,579 on Jul. 19, 2005, the contents of which are hereby incorporated by reference.
In the event of the failure of a storage system, the partner storage system may initiate a failover routine. This failover routine includes, inter alia, the assumption of data servicing operations directed to the disks formerly managed by the failed storage system. However, in certain situations, a storage system may suffer a temporary error condition that is easily remedied by a reboot or reinitialization quickly with little corresponding downtime. Given such a temporary error condition (and reboot), it may not be advantageous for the surviving storage system to initiate a failover routine. For example, if the failover routine would require a longer period of time to execute than for the failed storage system to correct the error condition, it would then be more advantageous to allow the failed storage system to perform the corrective reboot and resume processing data access requests directed to the disks.
However, a noted disadvantage of prior clustered failover implementations is that the failed storage system is not designed to transmit its boot status (e.g., its progress in performing a booting or initialization routine) to a cluster partner or otherwise broadcast this information over a network. As a cluster partner would typically require to know if its failed partner is successfully booting to decide whether or not to initiate a failover procedure, the surviving storage system must simply “wait and see.” However, under a wait and see approach, data access operations may be lost while the surviving storage system waits to determine if the failed storage system is booting properly, especially where the failed storage system is not booting properly.
Thus, a noted disadvantage of the prior art is the lack of a reliable mechanism for communicating the boot status of a failed storage system to its cluster partner while the failed storage system is booting. Without this effective communication mechanism, the surviving storage system must either perform a potentially unnecessary failover procedure or wait to determine if the booting storage system completes its boot procedure successfully.