A storage system typically comprises one or more storage devices into which data may be entered, and from which data may be obtained, as desired. The storage system may be implemented in accordance with a variety of storage architectures including, but not limited to, a network-attached storage environment, a storage area network and a disk assembly directly attached to a client or host computer. The storage devices are typically disk drives, wherein the term “disk” commonly describes a self-contained rotating magnetic media storage device. The term “disk” in this context is synonymous with hard disk drive (HDD) or direct access storage device (DASD).
The disks within a storage system are typically organized as one or more groups, wherein each group is operated as a Redundant Array of Independent (or Inexpensive) Disks (RAID). Most RAID implementations enhance the reliability/integrity of data storage through the redundant writing of data “stripes” across a given number of physical disks in the RAID group, and the appropriate storing of redundant information with respect to the striped data. The redundant information enables recovery of data lost when a storage device fails.
In the operation of a disk array, it is anticipated that a disk can fail. A goal of a high performance storage system is to make the mean time to data loss (MTTDL) as long as possible, preferably much longer than the expected service life of the system. Data can be lost when one or more disks fail, making it impossible to recover data from the device. Typical schemes to avoid loss of data include mirroring, backup and parity protection. Mirroring is an expensive solution in terms of consumption of storage resources, such as disks. Backup does not protect data modified since the backup was created. Parity schemes are common because they provide a redundant encoding of the data that allows for a single erasure (loss of one disk) with the addition of just one disk drive to the system.
Parity protection is used in computer systems to protect against loss of data on a storage device, such as a disk. A parity value may be computed by summing (usually modulo 2) data of a particular word size (usually one bit) across a number of similar disks holding different data and then storing the results on an additional similar disk. That is, parity may be computed on vectors 1-bit wide, composed of bits in corresponding positions on each of the disks. When computed on vectors 1-bit wide, the parity can be either the computed sum or its complement; these are referred to as even and odd parity respectively. Addition and subtraction on 1-bit vectors are both equivalent to exclusive-OR (XOR) logical operations. The data is then protected against the loss of any one of the disks, or of any portion of the data on any one of the disks. If the disk storing the parity is lost, the parity can be regenerated from the data. If one of the data disks is lost, the data can be regenerated by adding the contents of the surviving data disks together and then subtracting the result from the stored parity.
Typically, the disks are divided into parity groups, each of which comprises one or more data disks and a parity disk. A parity set is a set of blocks, including several data blocks and one parity block, where the parity block is the XOR of all the data blocks. A parity group is a set of disks from which one or more parity sets are selected. The disk space is divided into stripes, with each stripe containing one block from each disk. The blocks of a stripe are usually at the same locations on each disk in the parity group. Within a stripe, all but one block are blocks containing data (“data blocks”) and one block is a block containing parity (“parity block”) computed by the XOR of all the data. If the parity blocks are all stored on one disk, thereby providing a single disk that contains all (and only) parity information, a RAID-4 implementation is provided. If the parity blocks are contained within different disks in each stripe, usually in a rotating pattern, then the implementation is RAID-5. The term “RAID” and its various implementations are well-known and disclosed in A Case for Redundant Arrays of Inexpensive Disks (RAID), by D. A. Patterson, G. A. Gibson and R. H. Katz, Proceedings of the International Conference on Management of Data (SIGMOD), June 1988.
As used herein, the term “encoding” means the computation of a redundancy value over a predetermined subset of data blocks, whereas the term “decoding” means the reconstruction of a data or parity block by using a subset of data blocks and redundancy values. If one disk fails in the parity group, the contents of that disk can be decoded (re-constructed) on a spare disk or disks by adding all the contents of the remaining data blocks and subtracting the result from the parity block. Since two's complement addition and subtraction over 1-bit fields are both equivalent to XOR operations, this reconstruction consists of the XOR of all the surviving data and parity blocks. Similarly, if the parity disk is lost, it can be recomputed in the same way from the surviving data.
Parity schemes generally provide protection against a single disk failure within a parity group. These schemes can also protect against multiple disk failures as long as each failure occurs within a different parity group. However, if two disks fail concurrently within a parity group, then an unrecoverable loss of data is suffered. Failure of two disks concurrently within a parity group is a fairly common occurrence, particularly because disks “wear out” and because of environmental factors with respect to the operation of the disks. In this context, the failure of two disks concurrently within a parity group is referred to as a “double failure”.
A double failure typically arises as a result of a failure of one disk and a subsequent failure of another disk while attempting to recover from the first failure. The recovery or reconstruction time is dependent upon the level of activity of the storage system. That is, during reconstruction of a failed disk, it is possible that the storage system remains “online” and continues to serve requests (from clients or users) to access (i.e., read and/or write) data. If the storage system is busy serving requests, the elapsed time for reconstruction increases. The reconstruction process time also increases as the size and number of disks in the storage system increases, as all of the surviving disks must be read to reconstruct the lost data. Moreover, the double disk failure rate is proportional to the square of the number of disks in a parity group. However, having small parity groups is expensive, as each parity group requires an entire disk devoted to redundant data.
Another failure mode of disks is media read errors, wherein a single block or sector of a disk cannot be read. The unreadable data can be reconstructed if parity is maintained in the storage array. However, if one disk has already failed, then a media read error on another disk in the array will result in lost data. This is a second form of double failure. It can easily be shown that the minimum amount of redundant information required to correct a double failure is two units. Therefore, the minimum number of parity disks that can be added to the data disks is two. This is true whether the parity is distributed across the disks or concentrated on the two additional disks.
A known double failure correcting parity scheme is an EVENODD XOR-based technique that allows a serial reconstruction of lost (failed) disks. The EVENODD technique is disclosed in an article of IEEE Transactions on Computers, Vol. 44, No. 2, titled EVENODD: An Efficient Scheme for Tolerating Double Disk Failures in RAID Architectures, by Blaum et al, February, 1995. A variant of EVENODD is disclosed in U.S. Pat. No. 5,579,475, titled METHOD AND MEANS FOR ENCODING AND REBUILDING THE DATA CONTENTS OF UP TO TWO UNAVAILABLE DASDS IN A DASD ARRAY USING SIMPLE NON-RECURSIVE DIAGONAL AND ROW PARITY, by Blaum, et al., issued on Nov. 26, 1996. The above-mentioned article and patent are hereby incorporated by reference as though fully set forth herein.
In certain storage environments, it is common to utilize a significant number of lower quality disk drives, such as, e.g., in near line storage systems for use as short term storage before data is backed up to tape or other long-term archival systems. However, as the number of disks in an array increases, the probability that multiple failures will occur also increases. The probability is exacerbated by a lower mean time to failure (MTTF) of less expensive storage devices. Thus, it is possible to have storage systems experiencing triple failures, that is, the concurrent failures of three devices in the storage array. Furthermore, numerous storage protocols, such as Serial Attached SCSI (SAS), Fibre Channel (FC), etc., have resulted in increasingly complex architectures for disk shelves which have resulted in a concomitant increase in the number of failures experienced by the disk shelves, thereby resulting in loss of access to each disk connected to a failed disk shelf.
One technique for correcting triple failures is an extension of the EVENODD technique termed the STAR technique, which is described in Efficient and Effective Schemes for Streaming Media Delivery, by Cheng Wang, dated August 2005, the contents of which is hereby incorporated by reference.
A noted disadvantage of such EVENODD and/or STAR techniques is that they utilize asymmetric parity algorithms that require different computational steps when encoding and/or decoding parity. Furthermore, asymmetric algorithms imply that each disk is not treated identically. As a result, configuration management tasks must know and identify whether a disk is of a particular type, e.g., whether a disk is a parity disk and/or a data disk. For example, a reconstruction technique may involve a plurality of differing algorithms depending on the number of failed data and/or parity disks as well as the type of failed parity disks, e.g., row parity, diagonal parity, etc. The asymmetric nature of these algorithms imposes additional computational complexity when implementing parity-based systems. This additional complexity may be especially noticeable when utilizing embedded systems to implement parity-based computations.
A further noted disadvantage of asymmetric parity algorithms is that utilization of floating parity, i.e., parity stored on any of the storage devices within a parity group instead of on one or more dedicated parity storage devices, is not feasible when utilizing asymmetric parity algorithms. This is because floating parity relies on a scheme where some blocks on the newly added disk(s) are re-assigned as parity, converting their old locations within the parity set to data. However, because of the special properties of some of the parity disks, e.g., diagonal/anti-diagonal, asymmetric algorithms cannot move/relocate parity blocks on these disks to newly added disks.