Various forms of network data storage systems are known today. These forms include network attached storage (NAS), storage area networks (SANs), and others. Network storage systems are commonly used for a variety of purposes, such as providing multiple users with access to shared data, backing up critical data (e.g., by data mirroring), etc. Conventional network storage systems include file systems that include data sets, such as volumes, files (also referred to as containers, data storage units, logical units of data, logical unit numbers (LUNs), or virtual disks). A LUN may represent data on a physical disk or data of a virtual disk. The virtual disk or LUN may be physically represented as a regular file in the active file system, which in turn is treated specially by the operating system.
In conventional operating systems of network storage systems, LUNs are represented in a storage server as files contained in one or more volumes. To be externally visible to clients, LUNs must first go through an initialization process which prepares the LUNs for data access by the clients. This process may also be referred to as on-lining LUNs or a LUN on-lining procedure. This initialization process is performed whenever the storage server is in a state where the storage server is booting in a reboot or failover context. A LUN is traditionally associated with computer hardware, such as a disk drive. Within this disk drive is a disk controller that runs firmware. The disk controller is an embedded system that controls all aspects of the drive's functionality. This includes bringing the drive to an “online” state when it is powered up as well as handling I/O requests from clients. Representing LUNs as files decouples the LUN from the actual computer hardware, such as a disk drive. This allows the storage server to support an arbitrarily large number of LUNs even when the system may have far fewer disk drives. This ability to support a large number of LUNs exposes limitations with the conventional methods of on-lining LUNs.
For the entire duration that the storage server is in the rebooting or failover state, the storage server is unable to serve data. The time that the storage server is unable to serve data is called outage time. Since, initializing (e.g., on-lining) the LUNs is only one component of the entire boot or failover process, the amount of time the storage server takes to complete the initialization of the LUNs directly contributes to the outage time of the storage server. The outage time is the time that the storage server cannot serve data (e.g. off-line). Whenever the outage time exceeds input/output (I/O) timeout values established on the client-side, the excessive outage time can lead to application failure, which can negatively impact any service level agreements tied to system availability.
A conventional method by which the LUNs are initialized (e.g., on-lined) is that each volume is iterated over serially, and for each volume, each LUN is serially brought on-line asynchronously. In effect, each LUN is brought on-line serially due to the single administrative thread which processes each LUN one at a time. A conventional storage server initializes the LUNs by serially iterating through all of the LUNs of each volume, and performing subsequent operations for each LUN using a single administrative thread. The subsequent operations may include loading and reading information, such as mappings of the LUNs and their corresponding file handles, used by the file system layer by sending an asynchronous backdoor message to the file system layer. The backdoor message is a message type which provides an interface for subsystems in the operating system (e.g., Data ONTAP® software, available from Network Appliance of Sunnyvale, Calif.), to interact with the file system layer of the operating system.
This conventional method leaves a substantial amount of available system resources (e.g., processing, memory, and input-output (I/O) capabilities) unused. In addition, the asynchronous step is severely hampered by the serialization in the processing of each LUN. This is because the serial steps of the on-lining process are far more time consuming than the asynchronous steps. This may cause all the processing to queue up for a single administrative thread, which can only handle on-lining a single LUN at a time. The serialization is performed using a single administrative thread that is used to process each LUN of the volume.
In a data storage system with many volumes and LUNs, the initialization process cannot be scaled, and the serialization increases the overall latency to initialize (e.g., on-line) the LUNs. The increased latency to initialize the LUNs in turn contributes to increasing the outage time seen by clients attempting to perform I/O operations. The clients' I/O operations may timeout due to the excessive outage time on the storage server, which can ultimately lead to application failure.