A storage server is a special-purpose processing device used to store and retrieve data on behalf of one or more client devices (“clients”), which may access and/or process the data. A storage server can be used, for example, to provide multiple users with access to shared data and/or to backup important data.
A storage server may provide different levels of access to data. For example, a file server is an example of a storage server that provides file-level access to data. A file server operates on behalf of one or more clients to store and manage shared files in a set of mass storage devices, such as magnetic or optical storage based disks or tapes. The mass storage devices may be organized into one or more grouping (physical and/or logical or virtual) of Redundant Array of Inexpensive/Independent Disks (RAID). The data may be organized, managed, and/or accessed as data files. Another example of a storage server may be a device that provides clients with block-level access to stored data, rather than file-level access. The data in such a system may be organized and managed and/or accessed as data blocks, which may include more or less information than a file. Also, a storage server may be able to provide clients with both file-level access and block-level access.
A storage server may have access to multiple mass storage devices, or persistent/non-volatile storage devices, which may be managed based on logical or virtual organization. Data storage across these multiple mass storage devices can be organized into multiple layers of abstraction to provide fault tolerance, as individual disks can (and do) fail. The abstraction layers also allow a volume or aggregate to store larger quantities of data than can fit on a single disk.
For example, a storage server may represent a group of storage devices (e.g., hard disks) as a logical grouping of storage devices. In one embodiment a highest level logical grouping abstraction (e.g., data structure container) is an aggregate, which may be a container for other, lower-level logical groupings. The aggregates may be managed to store data in volumes contained within the aggregates. As used herein, volume refers to a logical abstraction of physical storage, combining one or more disks or parts of disks into a single logical storage object. The volumes may in turn be further logically broken down into plexes containing RAID groups. The RAID groups may have multiple disks. While particular terminology is used herein as a reference point to describe particular organizations and/or functions herein, the terminology shall not be construed as limiting, but rather by way of example. Where particular terminology is referred to (e.g., an aggregate, a plex, etc.), these are to be understood as merely examples of data structure abstractions that may be substituted with equivalent or similar data structures that may be referred to by other terms.
Tracking and management of the logical organization may require the management of logical association data. A disk or other storage device may have a dedicated area to provide a RAID label and/or other metadata to provide the ability to assign and determine which disks are part of which RAID groups, plexes, and aggregates, even as disks are added and failed out of the aggregates. The process of determining the logical data structure to which a disk belongs may be referred to as “RAID assimilation.”
The logical organization of the disks, and the management of the disks have traditionally assumed that the disks are online and available/viewable/accessible to a storage access interface and/or layer in the storage server. However, many events, both controlled as well as spontaneous, can result in a temporary service outage to a drive. A controlled event may be an event that has a planned or expected cause, for example, a firmware download and install on the disk, replacement of a storage component, topology reconfiguration of the disk storage subsystem, etc. Spontaneous events may be those that render a disk unresponsive without any expectation on the part of the system, for example, a temporary disk failure, transient loss of connectivity/access to a disk, etc., which may occur without warning or planning. The drive can become unresponsive to I/O commands during these or similar events. Traditionally, the system may have dealt with an unresponsive disk by removing the disk from the virtual system and reconstructing the data from the disk on a spare, for example. However, often the events that render the disk unresponsive may have only a short duration (e.g., on the order of minutes), which may be much smaller than the time required to rebuild the complete data on the drive. Despite being of relatively short duration as compared to data reconstruction, these durations may be long enough to cause expensive application downtimes. Traditionally, such events have either been handled by removing the disk from a RAID group, resulting in a complete disk/data reconstruction, or by scheduling planned downtime, typically in off-hours, or other times inconvenient for service personnel.