A storage system typically comprises one or more storage devices into which information may be entered, and from which information may be obtained, as desired. The storage system includes a storage operating system that functionally organizes the system by, inter alia, invoking storage operations in support of a storage service implemented by the system. The storage system may be implemented in accordance with a variety of storage architectures including, but not limited to, a network-attached storage environment, a storage area network and a disk assembly directly attached to a client or host computer. The storage devices are typically disk drives organized as a disk array, wherein the term “disk” commonly describes a self-contained rotating magnetic media storage device. The term disk in this context is synonymous with hard disk drive (HDD) or direct access storage device (DASD).
Storage of information on the disk array is preferably implemented as one or more storage “volumes” that comprises a cluster of physical disks, defining an overall logical arrangement of disk space. The disks within a volume are typically organized as one or more groups, wherein each group is operated as a Redundant Array of Independent (or Inexpensive) Disks (RAID). In this context, a RAID group is defined as a number of disks and an address/block space associated with those disks. The term “RAID” and its various implementations are well-known and disclosed in A Case for Redundant Arrays of Inexpensive Disks (RAID), by D. A. Patterson, G. A. Gibson and R. H. Katz, Proceedings of the International Conference on Management of Data (SIGMOD), June 1988.
The storage operating system of the storage system may implement a file system to logically organize the information as a hierarchical structure of directories, files and blocks on the disks. For example, each “on-disk” file may be implemented as set of data structures, i.e., disk blocks, configured to store information, such as the actual data for the file. The storage operating system may also implement a RAID system that manages the storage and retrieval of the information to and from the disks in accordance with write and read operations. There is typically a one-to-one mapping between the information stored on the disks in, e.g., a disk block number space, and the information organized by the file system in, e.g., volume block number space.
A common type of file system is a “write in-place” file system, an example of which is the conventional Berkeley fast file system. In a write in-place file system, the locations of the data structures, such as data blocks, on disk are typically fixed. Changes to the data blocks are made “in-place”; if an update to a file extends the quantity of data for the file, an additional data block is allocated. Another type of file system is a write-anywhere file system that does not overwrite data on disks. If a data block on disk is retrieved (read) from disk into a memory of the storage system and “dirtied” with new data, the data block is stored (written) to a new location on disk to thereby optimize write performance. A write-anywhere file system may initially assume an optimal layout such that the data is substantially contiguously arranged on disks. The optimal disk layout results in efficient access operations, particularly for sequential read operations, directed to the disks. An example of a write-anywhere file system that is configured to operate on a storage system is the Write Anywhere File Layout (WAFL™) file system available from Network Appliance, Inc., Sunnyvale, Calif.
Most RAID implementations enhance the reliability/integrity of data storage through the redundant writing of data “stripes” across a given number of physical disks in the RAID group, and the appropriate storing of redundant information with respect to the striped data. The redundant information, e.g., parity information, enables recovery of data lost when a disk fails. A parity value may be computed by summing (usually modulo 2) data of a particular word size (usually one bit) across a number of similar disks holding different data and then storing the results on an additional similar disk. That is, parity may be computed on vectors 1-bit wide, composed of bits in corresponding positions on each of the disks. When computed on vectors 1-bit wide, the parity can be either the computed sum or its complement; these are referred to as even and odd parity respectively. Addition and subtraction on 1-bit vectors are both equivalent to exclusive-OR (XOR) logical operations. The data is then protected against the loss of any one of the disks, or of any portion of the data on any one of the disks. If the disk storing the parity is lost, the parity can be regenerated from the data. If one of the data disks is lost, the data can be regenerated by adding the contents of the surviving data disks together and then subtracting the result from the stored parity.
Typically, the disks are divided into parity groups, each of which comprises one or more data disks and a parity disk. A parity set is a set of blocks, including several data blocks and one parity block, where the parity block is the XOR of all the data blocks. A parity group is a set of disks from which one or more parity sets are selected. The disk space is divided into stripes, with each stripe containing one block from each disk. The blocks of a stripe are usually at the same locations on each disk in the parity group. Within a stripe, all but one block contains data (“data blocks”), while the one block contains parity (“parity block”) computed by the XOR of all the data.
As used herein, the term “encoding” means the computation of a redundancy value over a predetermined subset of data blocks, whereas the term “decoding” means the reconstruction of a data or parity block by the same process as the redundancy computation using a subset of data blocks and redundancy values. If one disk fails in the parity group, the contents of that disk can be decoded (reconstructed) on a spare disk or disks by adding all the contents of the remaining data blocks and subtracting the result from the parity block. Since two's complement addition and subtraction over 1-bit fields are both equivalent to XOR operations, this reconstruction consists of the XOR of all the surviving data and parity blocks. Similarly, if the parity disk is lost, it can be recomputed in the same way from the surviving data.
If the parity blocks are all stored on one disk, thereby providing a single disk that contains all (and only) parity information, a RAID-4 level implementation is provided. The RAID-4 implementation is conceptually the simplest form of advanced RAID (i.e., more than striping and mirroring) since it fixes the position of the parity information in each RAID group. In particular, a RAID-4 implementation provides protection from single disk errors with a single additional disk, while making it easy to incrementally add data disks to a RAID group.
If the parity blocks are contained within different disks in each stripe, in a rotating pattern, then the implementation is RAID-5. Most commercial implementations that use advanced RAID techniques use RAID-5 level implementations, which distribute the parity information. A motivation for choosing a RAID-5 implementation is that, for most static file systems, using a RAID-4 implementation would limit write throughput. Such static file systems tend to scatter write data across many stripes in the disk array, causing the parity disks to seek for each stripe written. However, a write-anywhere file system, such as the WAFL file system, does not have this issue since it concentrates write data on a few nearby stripes.
Use of a RAID-4 level implementation in a write-anywhere file system is a desirable way of allowing incremental capacity increase while retaining performance; however there are some “hidden” downsides. First, where all the disks in a RAID group are available for servicing read traffic in a RAID-5 implementation, one of the disks (the parity disk) does not participate in such traffic in the RAID-4 implementation. Although this effect is insignificant for large RAID group sizes, those group sizes have been decreasing because of, e.g., a limited number of available disks or increasing reconstruction times of larger disks. As disks continue to increase in size, smaller RAID group configurations become more attractive. But this increases the fraction of disks unavailable to service read operations in a RAID-4 configuration. The use of a RAID-4 level implementation may therefore result in significant loss of read operations per second. Second, when a new disk is added to a full volume, the write anywhere file system tends to direct most of the write data traffic to the new disk, which is where most of the free space is located.
The RAID system typically keeps track of allocated data in a RAID-5 level implementation of the disk array. To that end, the RAID system reserves parity blocks in a fixed pattern that is simple to compute and that allows efficient identification of the non-data (parity) blocks. However, adding new individual disks to a RAID group of a RAID-5 level implementation typically requires repositioning of the parity information across the old and new disks in each stripe of the array to maintain the fixed pattern. Repositioning of the parity information typically requires use of a complex (and costly) parity block redistribution scheme that “sweeps-through” the old and new disks, copying both parity and data blocks to conform to the new distribution. The parity redistribution scheme further requires a mechanism to identify which blocks contain data and to ensure, per stripe, that there are not too many data blocks allocated so that there is sufficient space for the parity information. As a result of the complexity and cost of such a scheme, most RAID-5 implementations relinquish the ability to add individual disks to a RAID group and, instead, use a fixed RAID group size (usually in the 4–8 disk range). Disk capacity is then increased a full RAID group at a time. Yet, the use of small RAID groups translates to high parity overhead, whereas the use of larger RAID groups means having a high-cost for incremental capacity.
Therefore, it is desirable to provide a distribution system that enables a storage system to distribute parity evenly, or nearly evenly, among disks of the system, while retaining the capability of incremental disk addition.
In addition, it is desirable to provide a distribution system that enables a write anywhere file system of a storage system to run with better performance in smaller (RAID group) configurations.