A file server is a computer that provides file service relating to the organization of information on storage devices, such as disk drives (“disks”). The file server or filer includes a storage operating system that implements a file system to logically organize the data on disk information as a hierarchical structure of directories and files. Each “on-disk” file may be implemented as a set of data structures, e.g., disk blocks, configured to store information. A directory, on the other hand, may be implemented as a specially formatted file in which information about other files and directories are stored.
A type of file system is a write-anywhere file system that does not overwrite data on disks. If a data block on disk is retrieved (read) from disk into memory and “dirtied” with new data, the data block is stored (written) to a new location on disk to thereby optimize write performance. A particular example of a write-anywhere file system that is configured to operate on a filer is the Write Anywhere File Layout (WAFL™) file system available from Network Appliance, Inc. of Sunnyvale, Calif. The WAFL file system is implemented within a microkernel as part of the overall protocol stack of the filer and associated disk storage. This microkernel is supplied as part of Network Appliance's Data ONTAP™ software, residing on the filer, that processes file-service requests from network-attached clients.
As used herein, the term “storage operating system” generally refers to the computer-executable code operable on a storage system that manages data access and may, in the case of filers, implement file system semantics. In this sense, Data ONTAP software is an example of such a storage operating system implemented as a microkernel. The storage operating system can also be implemented as an application program operating over a general-purpose operating system, such as UNIX® or Windows NT®, or as a general-purpose operating system with configurable functionality, which is configured for storage applications as described herein.
The storage devices in a file server environment are typically disk drives organized as a disk array, wherein each disk is a self-contained rotating magnetic media storage device. A disk is typically a collection of platters, rotatable on a spindle, with each platter surface divided into concentric tracks, and each track divided into sectors. The sector is the smallest unit that can be individually accessed for input/output (I/O) operations, e.g., read or written. The term disk in this context is synonymous with a hard disk drive (HDD), a direct access storage device (DASD) or a logical unit number (lun) in a storage device. Unless the context indicates otherwise, the term “disk” as used herein is intended to embrace not only magnetic storage devices, but also optical, solid state and other types of storage devices. The term “sector” as used herein is intended to embrace the smallest unit of storage on the storage media that can be individually read or written, and may also be generally referred to by other names (e.g. block) depending on the media type. For clarity, it should be noted that storage operating systems may manage blocks as the smallest unit of storage, for example, each capable of storing 4 kilobytes of data, while the disk itself manages sectors, for example, each capable of storing 512 bytes or 520 bytes of data, depending on the type of drive. The storage operating system maintains a map of data blocks to disk sectors.
The storage operating system typically organizes data storage as one or more storage “volumes” that comprise physical storage disks, defining an overall logical arrangement of storage space. The disks within a volume are typically organized as one or more groups of Redundant Array of Independent (or Inexpensive) Disks (RAID). RAID implementations enhance the reliability/integrity of data storage through the redundant writing of data “stripes” across blocks of a given number of physical disks in the RAID group, and the appropriate caching of parity information with respect to the striped data. In the example of a WAFL file system, a RAID 4 implementation is advantageously employed. This implementation specifically entails the striping of data across a group of disks, and separate parity caching within a selected disk of the RAID group. As described herein, a volume typically comprises at least one data disk and one associated parity disk (or possibly data/parity) partitions in a single disk) arranged according to a RAID 4, or equivalent high-reliability, implementation.
In the operation of a disk array, it is fairly common that a disk will fail. A goal of a high performance storage system is to make the mean time to data loss (MTTDL) as long as possible, preferably much longer than the expected service life of the storage system. Data can be lost when one or more storage devices fail, making it impossible to recover data from the device. Typical schemes employed by storage systems to avoid loss of data include mirroring, backup and parity protection. Mirroring is an expensive solution in terms of consumption of storage resources, such as hard disk drives. Backup does not protect recently modified data. Parity schemes as used in RAID systems are common because they provide a redundant encoding of the data that allows for data recovery, typically, despite a failure of one of the disks in the array, at an overhead cost of just one additional disk in each array of the system.
Specifically, the redundant information provided by parity protection is computed as the exclusive-OR (XOR), i.e., the sum over one-bit fields, of the data on all disks. As referenced above, the disks are typically divided into parity groups, each of which comprises one or more data disks and a parity disk. The disk space is divided into stripes, with each stripe containing one block from each disk. Typically, the blocks of a stripe are at the same location on each disk in the parity group. Within a stripe, all but one block are data blocks and one block is a parity block, computed by the XOR of all the data.
If the parity blocks are all stored on one disk, thereby providing a single disk that contains all (and only) parity information, a RAID-4 implementation is provided. If the parity blocks are contained within different disks in each stripe, usually in a rotating pattern, then the implementation is RAID-5. If one disk fails in the parity group, the contents of that disk can be reconstructed on a second “spare” disk or disks by adding all the contents of the remaining data blocks and subtracting the result from the parity block. Since two's compliment addition and subtraction over one-bit fields are both equivalent to XOR operations, this reconstruction consists of the XOR of all the surviving data and parity blocks. Similarly, if the parity disk is lost, it can be recomputed in the same way from the surviving data.
As noted above, typically RAID implementations permit data recovery through reconstruction of the data from the remaining disks of an array following the failure of a single disk in the array. In the event of a second disk failure in the array prior to reconstruction of the data from the first failure, RAID systems typically cannot recover the data. This is called a “double-disk panic.” In such an event, the system would have to recover the data from a mirror or backup, if available.
Far more likely than a second disk drive failing in a RAID group (before reconstruction has been completed for a previous disk failure) is the possibility that there may be an unknown bad sector (media error) on an otherwise intact disk. For example, a media error can result from a flaw in the surface of the magnetic disk, a condition often caused by a head crash or misalignment (e.g., due to overheating). A disk typically detects a media error when it attempts to respond to a disk access request from the storage is operating system, e.g., during a read operation. In the event the read operation fails, the disk will normally attempt to recover the data from the sector involved using internal (to the disk) error recovery mechanisms. These can include retrying the read operation pursuant to a predetermined retry algorithm, repositioning the read/write head, and error detection and correction (EDC) algorithms (also referred to sometimes as error correction code (ECC)). Unfortunately, such internal error recovery mechanisms typically adversely impact disk read performance. If such internal error recovery mechanisms succeed in enabling the disk to respond successfully to the read request, a condition sometimes called “self-recovery,” the error is termed a “recovered error.” On the other hand, if such internal error recovery mechanisms fail to enable the drive to respond successfully to the read request, the error is termed an “non-recoverable error.” Non-recoverable errors are typically noted by the storage operating system, which may then resort to RAID parity in an attempt to recalculate the lost data. However, if a bad block is encountered while the RAID group is in degraded mode (after a disk failure but before reconstruction has completed), then that block's data cannot be recovered by the filer without the aid of a backup or other external error recovery mechanism, if available.
To protect against this scenario, filers routinely verify all data stored in the file system using RAID “scrubbing.” The scrubbing operation may be scheduled to occur at regular intervals (for example, once per week, early on Sunday morning). However, automatic scrubbing is optional and can be suppressed. During this process, filer self-generates I/O operations to read all data blocks from RAID groups that have no failed drives. If the disk encounters a problem in reading a sector during scrubbing operations, it will use self-recovery operations in a further attempt at obtaining the data indicated by the I/O operation. If it proves to be a non-recoverable error, the data of that sector is recomputed by the storage operating system using RAID techniques from the contents of the remaining disks of the array and then written again to disk.
Most disks provide a pool of spare sectors for use in dealing with media errors. The spares pool is a set of entries of one or more contiguous sectors in length on the disk drive. The size of the spares pool varies, but is generally proportional to the overall size of the disk and is deemed sufficient to accommodate the expected sector failures during the drive's normal life expectancy. In one example, the spares pool may be between 2,000-10,000 entries long.
When a faulty sector is encountered, a command is issued to reassign the sector, i.e., to change the assigned on-disk storage location for the data from the faulty sector to a new sector selected from the spares pool. As a result, the faulty sector is no longer used, its contents (data) are written to the new sector, and references to it are mapped to the new sector. Faulty sectors that have been reassigned are typically enumerated in a defect list that resides on the disk and is associated with the spares pool. Note that requests to add sectors to the defect list are typically made by the storage operating system. In general, operating systems will not reassign a slow-reading sector's data to a new sector in the spares pool as it may be considered wasteful of spares space. However, a sector with a recovered error is typically noted as such by a fully capable operating system, such as Data ONTAP. This recovered error information is, therefore, logged each time a read or write to the questionable sector is attempted. For a large-sized, high-density drive (gigabyte and terabyte-sized), or array of such drives, this constant (and ever-increasing) logging of information can reduce the efficiency of the storage operating system.
As the size of disk drives grows ever higher and densities within disk media become greater, the frequency of recovered errors increases as well. Moreover, these errors eventually often become unrecoverable errors if not addressed promptly. And even if they remain recoverable, they greatly increase storage operating system overhead associated with error recovery mechanisms and logging, as described above.