A storage system typically comprises one or more storage devices into which information may be entered, and from which information may be obtained, as desired. The storage system includes a storage operating system that functionally organizes the system by, inter alia, invoking storage operations in support of a storage service implemented by the system. The storage system may be implemented in accordance with a variety of storage architectures including, but not limited to, a network-attached storage environment, a storage area network and a disk assembly directly attached to a client or host computer. The storage devices are typically disk drives organized as a disk array, wherein the term “disk” commonly describes a self-contained rotating magnetic media storage device. The term disk in this context is synonymous with hard disk drive (HDD) or direct access storage device (DASD).
Storage of information on the disk array is preferably implemented as one or more storage “volumes” that comprises a cluster of physical disks, defining an overall logical arrangement of disk space. The disks within a volume are typically organized as one or more groups, wherein each group is operated as a Redundant Array of Independent (or Inexpensive) Disks (RAID). In this context, a RAID group is defined as a number of disks and an address/block space associated with those disks. The term “RAID” and its various implementations are well-known and disclosed in A Case for Redundant Arrays of Inexpensive Disks (RAID), by D. A. Patterson, G. A. Gibson and R. H. Katz, Proceedings of the International Conference on Management of Data (SIGMOD), June 1988.
The storage operating system of the storage system may implement a file system to logically organize the information as a hierarchical structure of directories, files and blocks on the disks. For example, each “on-disk” file may be implemented as set of data structures, i.e., disk blocks, configured to store information, such as the actual data for the file. The storage operating system may also implement a storage module, such as a RAID system, that manages the storage and retrieval of the information to and from the disks in accordance with write and read operations. It should be noted that the RAID system may also be embodied as a RAID controller of a RAID array; accordingly, the term “RAID system” as used herein denotes a hardware, software, firmware (or combination thereof) implementation. There is typically a one-to-one mapping between the information stored on the disks in, e.g., a disk block number space, and the information organized by the file system in, e.g., volume block number space.
A common type of file system is a “write in-place” file system, an example of which is the conventional Berkeley fast file system. In a write in-place file system, the locations of the data structures, such as data blocks, on disk are typically fixed. Changes to the data blocks are made “in-place”; if an update to a file extends the quantity of data for the file, an additional data block is allocated. Another type of file system is a write-anywhere file system that does not overwrite data on disks. If a data block on disk is retrieved (read) from disk into a memory of the storage system and “dirtied” with new data, the data block is stored (written) to a new location on disk to thereby optimize write performance. A write-anywhere file system may initially assume an optimal layout such that the data is substantially contiguously arranged on disks. The optimal disk layout results in efficient access operations, particularly for sequential read operations, directed to the disks. An example of a write-anywhere file system that is configured to operate on a storage system is the Write Anywhere File Layout (WAFL™) file system available from Network Appliance, Inc., Sunnyvale, Calif.
Most RAID implementations enhance the reliability/integrity of data storage through the redundant writing of data “stripes” across a given number of physical disks in the RAID group, and the appropriate storing of redundant information with respect to the striped data. The redundant information, e.g., parity information, enables recovery of data lost when a disk fails. A parity value may be computed by summing (usually modulo 2) data of a particular word size (usually one bit) across a number of similar disks holding different data and then storing the results on an additional similar disk. That is, parity may be computed on vectors 1-bit wide, composed of bits in corresponding positions on each of the disks. When computed on vectors 1-bit wide, the parity can be either the computed sum or its complement; these are referred to as even and odd parity respectively. Addition and subtraction on 1-bit vectors are both equivalent to exclusive-OR (XOR) logical operations. The data is then protected against the loss of any one of the disks, or of any portion of the data on any one of the disks. If the disk storing the parity is lost, the parity can be regenerated from the data. If one of the data disks is lost, the data can be regenerated by adding the contents of the surviving data disks together and then subtracting the result from the stored parity.
Typically, the disks are divided into parity groups, each of which comprises one or more data disks and a parity disk. A parity set is a set of blocks, including several data blocks and one parity block, where the parity block is the XOR of all the data blocks. A parity group is a set of disks from which one or more parity sets are selected. The disk space is divided into stripes, with each stripe containing one block from each disk. The blocks of a stripe are usually at the same locations on each disk in the parity group. Within a stripe, all but one block contains data (“data blocks”), while one block contains parity (“parity block”) computed by the XOR of all the data.
If the parity blocks are all stored on one disk, thereby providing a single disk that contains all (and only) parity information, a RAID-4 level implementation is provided. The RAID-4 implementation is conceptually the simplest form of advanced RAID (i.e., more than striping and mirroring) since it fixes the position of the parity information in each RAID group. In particular, a RAID-4 implementation provides protection from single disk errors with a single additional disk, while making it easy to incrementally add data disks to a RAID group.
If the parity blocks are contained within different disks in each stripe, in a rotating pattern, then the implementation is RAID-5. Most commercial implementations that use to advanced RAID techniques use RAID-5 level implementations, which distribute the parity information. A motivation for choosing a RAID-5 implementation is that, for most read-optimizing file systems, using a RAID-4 implementation would limit write through-put. Such read-optimizing file systems tend to scatter write data across many stripes in the disk array, causing the parity disks to seek for each stripe written. However, a write-anywhere file system, such as the WAFL file system, does not have this issue since it concentrates write data on a few nearby stripes.
As used herein, the term “encoding” means the computation of a redundancy value over a predetermined subset of data blocks, whereas the term “decoding” means the reconstruction of a data or parity block using a subset of data blocks and redundancy values. In RAID-4 and RAID-5, if one disk fails in the parity group, the contents of that disk can be decoded (reconstructed) on a spare disk or disks by adding all the contents of the remaining data blocks and subtracting the result from the parity block. Since two's complement addition and subtraction over 1-bit fields are both equivalent to XOR operations, this reconstruction consists of the XOR of all the surviving data and parity blocks. Similarly, if the parity disk is lost, it can be recomputed in the same way from the surviving data.
Parity schemes generally provide protection against a single disk failure within a parity group. These schemes can also protect against multiple disk failures as long as each failure occurs within a different parity group. However, if two disks fail concurrently within a parity group, then an unrecoverable loss of data is suffered. Failure of two disks concurrently within a parity group is a fairly common occurrence, particularly because disks wear out and because of environmental factors with respect to the operation of the disks. In this context, the failure of two disks concurrently within a parity group is referred to as a “double failure”. A double failure typically arises as a result of a failure of one disk and a subsequent failure of another disk while attempting to recover from the first failure. For example, a common source of double failure is a single failed disk combined with a single media failure (i.e., a single unreadable block in a row).
Symmetry is herein defined to mean rotational symmetry among disks of an array with respect to both the parity construction and disk reconstruction algorithms. More precisely, in an n disk array where the disks are numbered from 0 to m−1, for m>=n, re-numbering the disks by rotating them by some arbitrary amount k, such that disk j becomes disk (i+k) modulo m, does not change the parity calculations or the results of those calculations. In addition, uniformity is herein defined to mean that the same algorithm is used to compute the missing contents of a stripe, regardless of which disks are missing, and regardless of the positions of those disks in the array. If an array is uniform, it means that the algorithm used to construct redundant information or parity is the same no matter which disks in the array hold parity. It also means that the same algorithm can be used to reconstruct failed disks, regardless of what disks failed or whether they held redundant data or file system data.
A scheme incorporating a uniform algorithm allows use of all disks in the array to store data, with the parity blocks rotated or otherwise distributed among the disks to different locations in different stripes. All disks can then be used during read operations, while still achieving high-performance full and partial stripe write operations. Furthermore data structures, such as meta-data mapping files, may be configured to specify which disks contain parity and data in any particular stripe; these “maps” may be managed by the file system, separately from the RAID system, since the maps are not needed to perform reconstruction.
A typical single block row parity scheme can be used to implement a single failure-correcting scheme. All blocks in a stripe contribute equally to the invariant that the total parity of each bit position summed across all the blocks is even. It should be noted that the parity may be even or odd, as long as the parity value is known (predetermined); the following description herein is directed to the use of even parity. Therefore, during reconstruction, it is not necessary to know which blocks hold data and which hold parity. The lost block is reconstructed by summing, modulo-2, the bits in corresponding bit positions across each block. The sum is the missing block, since adding this block value to the sum of the others will produce zeros, indicating even parity for the stripe. This is true whether the missing block is a parity or data block.
To establish even parity in each stripe, one degree of freedom is needed. This is the content of the parity block, which is determined by the contents of all the data blocks. The RAID system is not free to change data block contents in a simple parity encoding, so the parity of the stripe can only be brought to the neutral even parity condition by setting the content of the single parity block.
A row-diagonal (RD) parity technique provides double failure parity correcting recovery using row and diagonal parity in a disk array. The RD parity technique may be used in an array comprising a number n of storage devices, such as disks, including a row parity disk and a diagonal parity disk, wherein n=p+1 and p is a prime number. The disks are divided into blocks and the blocks are organized into stripes, wherein each stripe comprises (n−2) rows. The diagonal parity disk stores parity information computed along diagonal parity sets (“diagonals”) of the array. The diagonals are defined such that every diagonal covers all but one of the disks in the stripe.
The blocks in the stripe are organized into (n−1) diagonals, each of which contains (n−2) blocks from the data and row parity disks, and all but one of which stores its parity in a block on the diagonal parity disk. Within a stripe, a diagonal parity block does not participate in the row parity set. However, implementation of the RD parity technique requires knowledge of which parity diagonal is the missing diagonal, i.e., the diagonal for which a parity block is not computed or stored. The RD parity technique is described in U.S. patent application Ser. No. 10/035,607 titled Row-Diagonal Parity Technique for Enabling Efficient Recovery from Double Failures in a Storage Array, by Peter F. Corbett et al., filed on Dec. 28, 2001, which application is hereby incorporated by reference as though fully set forth herein.
A typical RAID-4 array or a stripe of a RAID-5 array, each having a collection of data disks and a designated row parity disk, may be configured to implement the RD parity technique. To allow implementation of the RD parity technique and, thus, reconstruction from any two disk failures in the array, the RAID-4 array or stripe of the RAID-5 array is extended through the addition of a diagonal parity disk. The number of disks in the parity set is (p+1), and the number of rows of blocks needed to complete a double disk failure tolerant group of stripes is (p−1). Of the (p+1) disks, at least two must contain parity information, and exactly one of those disks must contain the diagonal parity for the array. One or more other redundant disks contain row parity information. The remaining (p−1) or fewer disks contain data. Any number of these disks can be left out of the array. Disks that are not present are assumed to contain all zeros for the purposes of parity calculations. This allows the use of arrays having different sizes and the ability to add data disks to an existing array without recalculating parity.
FIG. 1 is a schematic block diagram of a disk array 100 that is configured according to a row-diagonal (RD) parity arrangement, wherein p=5. The numbers in each position correspond to the diagonal parity set to which the block (or sub-block) belongs. The diagonal parity blocks are the modulo-2 sum of the blocks in the corresponding diagonal. The row parity blocks are the modulo-2 sum of all the data blocks in the corresponding row. In accordance with the RD parity technique, row parity is computed across all disks of a stripe, except the diagonal parity disk. The diagonal parity disk has a unique function that is not uniform or symmetric with respect to all other disks of the array, i.e., the diagonal parity disk only stores diagonal parity and does not participate in row parity calculations. The RD parity double failure-correcting technique is therefore not symmetric and the present invention is directed to providing a double failure-correcting technique that is symmetric.
An advantage of varying the locations of data and parity blocks in the array from stripe to stripe is that it may improve read performance. If all disks contain data in a uniform proportion, then the read workload will likely be balanced across all the disks. An advantage of uniformity is that there is no performance difference or algorithmic variation dependency on the positions of the redundant blocks in the stripe. Therefore, the decision of which blocks to use to store parity can be made without any bias due to the impact the choice might have on the performance of redundant data construction. This allows full flexibility in choosing which disks to hold redundant data in each stripe.