The creation and storage of digitized data has proliferated in recent years. Accordingly, various storage systems that facilitate efficient and cost effective storage of large amounts of digital data are common today. For example, a cluster network environment of nodes may be implemented as a data storage system to facilitate the creation, storage, retrieval, and/or processing of digital data. Such a data storage system may be implemented using a variety of storage architectures, such as a network-attached storage (NAS) environment, a storage area network (SAN), a direct-attached storage environment, and combinations thereof. The foregoing data storage systems may comprise one or more data storage devices (e.g., disk drives, solid state drives (SSD), etc.) configured to store digital data within data volumes. For example, various data storage devices, or portions thereof, may be combined to form an aggregate, whereby such aggregates may provide storage space for volumes. In operation, various cluster and node components of the cluster network environment interact to provide storage services to clients using the aforementioned volumes.
Data storage systems often implement configurations adapted to facilitate robust data storage operation. For example, a high availability (HA) pair configuration, wherein nodes of the data storage system are paired (such pairing may include N-way pairing) to provide continued access to data store volumes in the event of a failure or malfunction of a node, in order to maintain availability of the stored data. In operation, a node of a HA pair takes over for a failed partner node of the HA pair by mounting the volumes belonging to that partner node. Accordingly, although data throughput may be impacted due to the takeover node providing access to its own volumes and those of the failed node, the volumes of the failed node and the data stored thereon nevertheless remain available to storage system clients.
A volume mount (i.e., access information regarding the data storage device(s) storing the volume and configure the filesystem so as to place the volume in a position to operate), as implemented by existing HA pair implementations, requires serial completion of a plurality of steps which require appreciable time to complete (e.g., on the order of 5-10 seconds), making the volumes of a HA pair failed node unavailable for an appreciable period of time. In particular, the existing volume mount process requires reading several random storage device blocks (e.g., blocks containing the volume information, such as may contain file system information, block allocation map information, directory information, etc.) serially and constructing various in-core data structures. As mounting takes place before a computer can use a data storage device (i.e., mounting makes the data storage device accessible through the computer's filesystem), the foregoing time to mount a takeover volume results in appreciable delay in the availability of that volume.
In takeover of a failed node's volumes, the HA pair takeover node of existing implementations must first determine that the HA pair partner node is indeed a failed node. This is because if the volume mount techniques utilized by these HA pair implementations are initiated by a takeover node while the partner node continues to operate and write to its volumes the data stored by those volumes would be corrupted (e.g., the blocks which are required to be read to mount the volume may be altered by the partner node while the putative takeover node is mounting to volume resulting in corrupted data if the volume were to be accessed with the out-of-date blocks). Once the HA pair takeover node has determined that its HA pair partner node is a failed node, the takeover node then begins the process of bringing the data storage device aggregate(s) of the HA pair failed node online (referred to as “onlining”). Until the aggregate has been onlined, the cache which maps the physical volume block numbers (PVBNs) used by the volumes to the disk block numbers (DBNs) of the storage devices is not available. Thus, the volumes of the failed node cannot be mounted until the aggregates containing those volumes has been onlined. Once the aggregate has been onlined, the PVBNs for the blocks which are required to be read to mount a volume may be utilized in pre-fetching those blocks and mounting the respective volumes.
To summarize the foregoing volume mount process implemented by a HA pair takeover node, the takeover node must determine that a partner node has failed, then the takeover node must online the aggregates of the failed node, and only then can the takeover node use the PVBNs of the aggregates to mount the volumes. As can be appreciated from the foregoing, a volume mount process is I/O bound and a relatively slow process making the takeover process lengthy. Such a lengthy volume mount process is not desirable or acceptable for some applications, such as tier-1 applications (e.g., mission critical applications requiring high levels of reliability or quality of service).
If all the blocks which are required to be read to mount a volume are already present in-core, then a volume mount process can be accomplished faster because the disk I/O bottleneck is avoided. Accordingly, techniques such as an “adaptive playlist,” as shown and described in U.S. Pat. No. 7,945,724 entitled “Non-Volatile Solid-State Memory Based Adaptive Playlist for Storage System Initialization Operations,” the disclosure of which is hereby incorporated herein by reference, provide pre-fetching of all the blocks required to mount volume. For example, an adaptive playlist technique maintains a per-volume metafile which contains list of PVBNs required to be read to mount the volumes. Before mounting a volume all the PVBNs are pre-fetched in memory doing parallel I/O to disks, thereby making the volume mount operation faster. The volume mount, however, needs to wait for all the blocks to be pre-fetched, which does not scale if there are hundreds of volumes to mount during takeover.
Another technique for providing all the blocks which are required to be read to mount a volume in-core implements mirroring. A mirror approach operates to mirror the blocks required to mount the volumes to the HA pair partner node periodically (e.g., at each consistency point (CP), wherein consistency points are checkpoints or a snapshot in time in the write cycle of the filesystem) so that the information of all of the blocks required to mount partner volumes is already present in the memory of the takeover node when a failure of a partner node occurs. This approach, however, is quite costly in terms of processing resources and communication bandwidth. In particular, a mirror approach may be too costly due to its implementation increasing CP path length and interconnect traffic to mirror blocks to a HA pair partner node. The mirror approach does not scale well because, as the number of volumes increases, more data needs to be mirrored consuming both CPU and memory at the HA pair partner nodes. Moreover, such mirror techniques do not work for N-way HA pairings where any node can takeover a sub-set of volumes belonging to the failed node. In a worst case scenario, if the node crashes in the middle of mirroring then volume mount during takeover would need to access the data from the disk blocks rather than the mirrored data in the takeover node memory (i.e., the partial mirror data would be corrupt).