A storage server is a special-purpose processing device used to store and retrieve data on behalf of one or more client devices (“clients”), which may access and/or process the data. A storage server can be used, for example, to provide multiple users with access to shared data and/or to backup important data.
A storage server may provide different levels of access to data. For example, a file server is an example of a storage server that provides file-level access to data. A file server operates on behalf of one or more clients to store and manage shared files in a set of mass storage devices, such as magnetic or optical storage based disks or tapes. The mass storage devices may be organized into one or more groupings (physical and/or logical or virtual) of Redundant Array of Inexpensive/Independent Disks (RAID). The data may be organized, managed, and/or accessed as data files. Another example of a storage server may be a device that provides clients with block-level access to stored data, rather than file-level access. The data in such a system may be organized and managed and/or accessed as data blocks, which may include more or less information than a file. Also, a storage server may be able to provide clients with both file-level access and block-level access.
A storage server may have access to multiple mass storage devices, or persistent/non-volatile storage devices, which may be managed based on logical or virtual organization. Data storage across these multiple mass storage devices can be organized into multiple layers of abstraction to provide fault tolerance, as individual disks can (and do) fail. The abstraction layers also allow a logical disk organization, for example, a volume or aggregate, to store larger quantities of data than can fit on a single disk.
For example, a storage server may represent a group of storage devices (e.g., hard disks) as a logical aggregate/grouping of storage devices. The aggregates may be managed to store data in volumes contained within the aggregates. As used herein, volume refers to a logical abstraction of physical storage, combining one or more disks or parts of disks into a single logical storage object. The volumes may in turn be further logically broken down into plexes containing RAID groups. The RAID groups may have multiple disks. While particular terminology is used herein as a reference point to describe particular organizations and/or functions herein, the terminology shall not be construed as limiting, but rather by way of example. Where particular terminology is referred to (e.g., an aggregate, a plex, etc.), these are to be understood as merely examples of data structure abstractions that may be substituted with equivalent or similar data structures that may be referred to by other terms.
Tracking of disks in a logical organization and management of the logical organization may be performed with logical association data. The term “disk” is generally used herein as shorthand to refer to a disk drive, including its actual storage medium or media. A disk or other storage device may have a dedicated area to provide a RAID label and/or other metadata to provide the ability assign and determine which disks are part of which RAID groups, plexes, and aggregates, even as disks are added and failed out of the aggregates. The process of determining the logical data structure to which a disk belongs may be referred to as “RAID assimilation.”
A storage device can develop errors (e.g., mechanical instabilities, medium corruption, etc.) that will hinder reading and/or writing to one or more areas/blocks of the storage device. Such errors often result in slow access times, especially if an access request focuses on bad media patches. Many storage devices have error recovery routines that can enable a distressed device to eventually complete the requested I/O (input/output, e.g., read, write) operation successfully. Performing an error recovery routine may include performing a recalibration routine, an automatic retry attempt, etc., and/or a combination of these. A storage device experiencing errors as described above may be referred to herein as a “distressed” drive, or a “spasming” drive. “Spasm” may be used herein to refer to an error resulting in a delay and/or the associated delay that results in slow data access.
Error recovery is traditionally handled “in-band,” referring to the process of handling the error in the context of servicing the access request (or I/O) that spawned the error. The delay mentioned refers to the delay in servicing (e.g., processing, responding to, performing actions or operations as a result of) the request. Because all error handling is traditionally performed in- band, delays caused by errors are eventually pushed back to the requesting client. Thus, the client traditionally has to wait for the operation to be completed before the request can be acknowledged as complete or successful, if it can be so acknowledged at all. At worst an error message is passed to the client instead of an acknowledgement of success, or the client request results in an error in the filer or client interface.
Hardware commoditization and the continuously decreasing cost of storage (price per GB) is driving a trend towards using less reliable, higher capacity drives for building storage servers and disk arrays. As deposition densities on drives increases, the effects of bad patches on the media become more apparent. Bad media patches can substantially degrade read/write performance, for example, with increased completion latency, especially if I/O happens to focus on the regions of bad media. Drives use an in-built error recovery scheme (native to the drive) when reading/writing blocks within a bad media patch that may take a relatively long time to complete.
As one example, ATA drives use an in-built recovery that can range in duration from multiple seconds to even minutes. In desktop usage where the inability to read a block can prove fatal to system operation, such latency may be acceptable. However, in server environments, long I/O operation complete times may be unacceptable. In another example, FC drives typically use a deterministic error recovery, which can prevent the long I/O delays discussed above. However, successive media errors can still result in cumulative queuing delays that can result in application timeouts and/or system hangs if I/Os cannot be completed within an acceptable time-period. Traditionally, storage systems have either resorted to failing such drives, which results in a costly drive reconstruction, or simply operated with exposure to the risk of application downtimes.
The duration of recovery routines to prevent failure of I/O operations on a spasming drive can range from multiple seconds to minutes, which may be acceptable for certain environments. However, in certain server environments, many data access protocols (e.g., CIFS (common Internet file services/systems), FC (Fibre Channel), NFS (network file services/system)) rely on lease timeouts for maintaining session state, meaning long delays in I/O access time can lead to unwanted connection terminations. Additionally, in file server/filer system configurations, many I/O accesses (e.g., a stripe write) are associated with parallel accesses to other storage devices in a redundant array (protection unit). Traditional systems delay returning the parallel results until all parallel I/O accesses are completed by the individual storage devices of the protection unit. Thus, a single spasming drive can cause a delay for data access from multiple storage devices.
File server/filer configurations typically use redundant drive arrays with mirroring and/or redundancy (e.g., RAID), enabling the systems to regenerate the data on one drive by accessing a subset of drives within a protection unit to which an unresponsive drive belongs. The latency for reconstructing data in this manner should be of the same order as a single drive latency, which is on the order of milliseconds. Assuming that the probability of multiple drive spasming at the same time is very low, reconstructing blocks to service an I/O request should be much faster than accessing a spasmic drive. Thus, successful access that causes a long delay may cause a more deleterious effect on system performance than would be caused by outright failure of the device. Additionally, a subsequent I/O request to the distressed device may trigger the same fault or a related fault, resulting in further client requests being delayed. However, the cost of failing out a device that does not have fatal errors may not be justified.
Besides individual disk errors and resulting access delays experienced by clients attempting to access data stored on a spasming disk, individual disk errors can compound and result in performance problems with the filer. For example, some error recovery attempts in high workload scenarios can result in a cumulative backoff causing a filer operating system to freeze up.