This invention relates in general to computer mass storage systems and, more particularly, to prioritizing data rebuild in the event of a disk failure in a hierarchical system utilizing a Redundant Array of Independent Disks (RAID).
Conventional disk array data storage systems have multiple disk storage devices that are arranged and coordinated to form a single mass storage system. A Redundant Array of Independent Disks (RAID) system is an organization of data in an array of mass data storage devices, such as hard disk drives, to achieve varying levels of data availability and system performance. Data availability refers to the ability of the RAID system to read and write data in the array of data storage devices even in the event of a failure of one or more of the individual data storage devices or components in the array. A measurement of system performance is the rate at which data can be sent to or received from the RAID system.
Data availability is often provided through the use of redundancy schemes where data, or relationships among data, are stored in multiple locations on the storage system. In the event of a disk or component failure, redundant data is retrieved from the operable portion of the system and used to regenerate the original data that is lost due to the failure. There are two common methods for storing redundant data: mirror and parity. In mirror redundancy, data is duplicated and stored in two or more separate areas of the storage system. In parity redundancy, redundant data is stored in one or more areas of the storage system, but the size of the redundant storage area is less than the storage space used to store the original data.
RAID systems typically designate part of the physical storage capacity in the array to store redundant data, either mirror or parity. The redundant information enables regeneration of user data in the event that one or more of the array""s member disks, components, or the access paths to the disk(s) fail. Typically, the disks are divided into equally sized address areas referred to as xe2x80x9cblocks.xe2x80x9d A set of blocks that has the same unit address ranges from each disk is referred to as a xe2x80x9cstripexe2x80x9d or xe2x80x9cstripe set.xe2x80x9d A set (or subset) of disks in the array over which a stripe or stripe set spans is referred to as a redundancy group. Traditionally, RAID arrays employ one or more redundancy groups and a single redundancy scheme for each redundancy group, although the schemes may vary among the redundancy groups. However, as will be discussed subsequently herein, hierarchical RAID arrays employ one or more redundancy schemes (i.e., RAID levels) for each redundancy group in an array.
From a data management and data redundancy perspective, RAID levels are typically characterized as one of six architectures, or redundancy schemes, enumerated as RAID levels 1-6. Although other RAID levels exist, levels 1-6 are the most commonly used and will be discussed herein with respect to the present invention. However, it should be noted that the present invention is applicable to any RAID level or data redundancy scheme.
The use of disk mirroring is referred to as RAID Level 1, where original data is stored on one set of disks and a duplicate copy of the data is kept on separate disks. The use of parity checking is referred to as RAID Levels 2, 3, 4, 5, and 6. In general, although RAID 1 provides higher data reliability and may provide the best small-write input/output (I/O) performance over RAID Levels 2, 3, 4 and 5, it uses the most storage space because all data is duplicated. In contrast, RAID Levels 2-5 provide a lesser amount of data reliability (relative to RAID 1) and, typically, reduced small-write performance. However, they don""t consume as much disk space as a RAID 1 technique because data is not duplicated but rather interleaved and parity checked across the disk array in a stripe set. A parity stripe set interleaves data and redundant (parity) data on multiple member disks. The parity stripe set presents a single virtual disk whose user data capacity is approximately the sum of the capacities of its members, less the storage used for holding the parity (redundant) data of the user data. For RAID levels 3-5, parity is commonly calculated using a bit by bit Exclusive OR function of corresponding data chunks in a stripe set from all of the data disks. This corresponds to a one equation, one unknown, sum of products calculation. The mirror set in a RAID 1 architecture presents a single virtual disk whose user data capacity is the sum of the capacity of one-half of its members, the other half holding the mirrored (redundant) data of the user data.
In addition to RAID mirror level 1, RAID parity levels 4, 5 and 6 are of particular interest for the present discussion. Specifically, for example, a RAID 4 uses a stripe set or redundancy group and a single dedicated parity disk to store redundant information about the data existing on the other data disks in the stripe set or redundancy group. Segments of data from each virtual disk sector are distributed across corresponding sectors of all but one of the stripe set members (i.e., the parity disk), and the parity of the distributed segments is written in the corresponding sector of the parity disk.
Because a RAID 4 system stores all parity blocks on a single unit in the stripe set, the single unit containing the parity blocks is accessed disproportionately relative to the other data storage devices in the stripe set or redundancy group. To eliminate the resulting constriction of data flow in a RAID 4 system, a RAID 5 architecture distributes the parity blocks across all of the data storage devices in the stripe set or redundancy group. Typically in a RAID 5 system, a set of N+1 data storage devices forms the stripe set or redundancy group. Each stripe has N blocks of data and one block of parity data. The block of parity data is stored in one of the N+1 data storage devices. The parity blocks corresponding to the remaining stripes of the stripe set or redundancy group are stored across the data storage devices within the stripe set or redundancy group. For example, in a RAID 5 system using five data storage devices in a given stripe set or redundancy group, the parity block for the first stripe of blocks may be written to the fifth device; the parity block for the second stripe of blocks may be written to the fourth device; the parity block for the third stripe of blocks may be written to the third device; etc. Typically, the location of the parity block for succeeding blocks shifts to the succeeding logical device in the stripe set or redundancy group, although other patterns may be used.
A RAID 6 architecture is similar to RAID 4 and 5 in that data is striped, but is dissimilar in that it utilizes two independent and distinct parity values for the original data, referred to here as P and Q. The P parity is commonly calculated using a bit by bit Exclusive OR function of corresponding data chunks in a stripe from all of the data disks. This corresponds to a one equation, one unknown, sum of products calculation. On the other hand, the Q parity is calculated linearly independent of P, but again using a different algorithm for sum of products calculation. As a result, each parity value is calculated using an independent algorithm and each is stored on a separate disk in the stripe set or redundancy group. Consequently, a RAID 6 system can rebuild data (assuming rebuild space is available) even in the event of a failure of two separate disks within a stripe set or redundancy group, whereas a RAID 5 system can rebuild data only in the event of no more than a single disk failure within a stripe set or redundancy group.
Similar to RAID 5, a RAID 6 architecture distributes the two parity blocks across all of the data storage devices in the stripe set or redundancy group. Thus, in a stripe set or redundancy group of N+2 data storage devices, each stripe has N blocks of data and two blocks of independent parity data. One of the blocks of parity data is stored in one of the N+2 data storage devices, and the other of the blocks of parity data is stored in another of the N+2 data storage devices. The parity blocks corresponding to the remaining stripes of the stripe set or redundancy group are stored across the data storage devices within the stripe set or redundancy group. For example, in a RAID 6 system using five data storage devices in a given stripe set or redundancy group, the parity blocks for the first stripe of blocks may be written to the fourth and fifth devices; the parity blocks for the second stripe of blocks may be written to the third and fourth devices; the parity blocks for the third stripe of blocks may be written to the second and third devices; etc. Typically, again, the location of the parity blocks for succeeding blocks shifts to the succeeding logical device in the stripe set or redundancy group, although other patterns may be used.
More information detailing the architecture and performance of RAID systems can be found in The RAID Book: A Source Book for RAID Technology, by the RAID Advisory Board, published Jun. 9, 1993, the disclosure of which is incorporated in full herein by reference. Additionally, a background discussion of RAID systems, and various ways to logically partition RAID systems, is found in U.S. Pat. No. 5,519,844 to David C. Stallmo, entitled xe2x80x9cLogical Partitioning of a Redundant Array Storage System,xe2x80x9d incorporated in full herein by reference.
A hierarchical data storage system permits data to be stored according to one or more different techniques, such as different redundancy schemes, RAID levels, redundancy groups, or any combination of these. For example, in a hierarchical RAID system, data can be stored in one or more redundancy groups, and for each redundancy group, according to one or more RAID architectures (levels) to afford tradeoffs between the advantages and disadvantages of the redundancy techniques. For purposes of this disclosure, a hierarchical system includes: (1) one that automatically or dynamically migrates data between redundancy schemes (i.e., different RAID levels) and/or redundancy groups for optimum system tuning and performance, and (2) one that requires user input to configure redundancy schemes or groups, such as one that requires a user to designate a given RAID level with an identified set or subset of disks in a system array.
U.S. Pat. No. 5,392,244 to Jacobson et al., entitled xe2x80x9cMemory Systems with Data Storage Redundancy Managementxe2x80x9d, incorporated in full herein by reference, describes a hierarchical RAID system that enables data to be migrated from one RAID type to another RAID type as data storage conditions and space demands change. This patent describes a multi-level RAID architecture in which physical storage space is mapped into a RAID-level virtual storage space having mirror and parity RAID areas (e.g., RAID 1 and RAID 5). The RAID-level virtual storage space is then mapped into an application-level virtual storage space, which presents the storage space to the user as one large contiguously addressable space. During operation, as user storage demands change at the application-level virtual space, data can be migrated between the mirror and parity RAID areas at the RAID-level virtual space to accommodate the changes. For instance, data once stored according to mirror redundancy may be shifted and stored using parity redundancy, or vice versa.
With data migration, the administrator is afforded tremendous flexibility in defining operating conditions and establishing logical storage units (or LUNs). As one example, this type of RAID system can initially store user data according to the optimum performing RAID 1 configuration. As the user data approaches and exceeds 50% of a stripe set or redundancy group array capacity, the system can then begin storing data according to both RAID 1 and RAID 5, and dynamically migrating data between RAID 1 and RAID 5 in a continuous manner as storage demands change. At any one time during operation, the data might be stored as RAID 1 or RAID 5 or both on all of the disks. The mix of RAID 1 and RAID 5 storage changes dynamically with the data I/O and storage capacity. This allows the system to dynamically optimize performance and available capacity versus an increasing amount of user data.
Clearly, each RAID level has characteristic cost-performance and cost-capacity ratios. Importantly, however, RAID systems maintain and manage redundant data to enable a recovery of the data in the event of a storage disk or component failure. To this regard, each RAID level also has a characteristic availability that determines the mean time to data loss as a function of the number of disks employed, i.e., different RAID levels provide different degrees of data protection. In the event of a disk or component failure, redundant data is retrieved from the operable portion of the system and used to regenerate or rebuild the original data that is lost due to the component or disk failure. Specifically, when a disk in a RAID redundancy group fails, the array attempts to rebuild data on the surviving disks of the redundancy group (assuming space is available) in such a way that after the rebuild is finished, the redundancy group can once again withstand a disk failure without data loss. Depending upon system design, the rebuild may be automated or may require user input. Design factors that affect rebuild include, for example, whether a spare disk is specified in the array, or whether the failed disk must be manually replaced by a user.
After detecting a disk or component failure and during a rebuild of data, regardless of rebuild design, the system remains subject to yet further disk or component failures in the same stripe set or redundancy group before the rebuild is complete. In any RAID system, this is significant because the vulnerability of data loss is dependent upon the RAID architecture (redundancy scheme) employed for that data.
For example, consider a hierarchical RAID array that uses ten disks in a single redundancy group with RAID 1 and RAID 5 storage schemes, where each RAID level may be employed separately or jointly on any one or more of the disks. The RAID 5 storage includes a stripe of data of a single block size on each disk. Each of nine disks holds actual user data in its respective block of the stripe. The tenth disk holds a block of data in the stripe containing redundant (parity) information. If a disk fails, the data on the failed disk can be reconstructed from the data on the remaining nine disks. The original and reconstructed data can then be re-written across the nine remaining good disks in a new stripe (with one of those disks being designated as the parity disk for the new stripe). However, if any of the nine remaining disks fail before all of the RAID 5 data is rebuilt and re-written, then data will be lost.
For the RAID 1 storage of this example, data stored on a first of the 10 disks is mirrored only on a second disk in the array. If the first disk fails, then during rebuild from the second mirror disk, data will only be lost if the second mirror disk fails. Other disks could fail and the array would not lose the RAID 1 data, but RAID 5 data would be lost.
Therefore, in this example of RAID 1 and RAID 5 hierarchical storage, in the event of a disk failure and during a rebuild, the RAID 5 storage is more vulnerable, i.e., has a greater probability of data loss, than the RAID 1 storage. However, conventional hierarchical RAID systems rebuild data irrespective of the vulnerability of the RAID levels employed.
Accordingly, to minimize the probability of data loss during a rebuild in a hierarchical RAID system, there is a need to manage data recovery and rebuild that accounts for data availability characteristics of the hierarchical RAID levels employed.
According to principles of the present invention in a preferred embodiment, in a hierarchical data storage system employing data redundancy schemes, such as a RAID system, a method of managing data in response to a disk failure in the storage system includes prioritizing a data rebuild based on a most vulnerable data redundancy scheme identified in the storage system. Prioritizing the data rebuild includes enabling a rebuild of the most vulnerable data redundancy scheme prior to enabling a rebuild of any other data redundancy scheme in the system. The most vulnerable data redundancy scheme is determined by comparing a probability of losing data that can be prevented by a rebuild for each data redundancy scheme in the system with respect to the potential for one or more next disk failures in the data storage system. The probability of losing data for each data redundancy scheme is determined by considering characteristics associated with the storage system and storage devices in the array, such as number of storage devices, number of device failures, mean time between failure, mean time or calculated time to rebuild, and failure dependencies.
The present invention further includes a data storage system and apparatus embodying the rebuild prioritization method described.
Other objects, advantages, and capabilities of the present invention will become more apparent as the description proceeds.