This invention relates generally to disk array architectures, and more particularly to systems, methods, and computer program products for providing enhanced tolerance of data loss in a disk array system.
Computer systems often require a considerable amount of nonvolatile disk storage to preserve software, programs and other data that cannot fit in smaller, more costly random access memory (RAM) and that otherwise would be lost when the system is powered off. Storage systems may include a large number of hard disk drives (HDDs). HDDs are typically constructed using one or more disk shaped platters coated with a magnetic material. The disk platters spin at fixed speeds and a movable arm with a read/write head is directed to specific locations on the disk to write or read data. The head glides just above the surface of the platter. During a data write operation, an electric field is applied to a specific location on the disk creating a substantially permanent magnetic field in a specific direction associated with a binary value of “0” or “1”. The head is designed to read stored data by sensing a small current induced in the head by the magnetic field when the head passes over the magnetized location on the platter. When the HDD is powered off, data is preserved as magnetic signatures for bits of information at specific locations on the disk.
HDD platters are partitioned into concentric circles called tracks that are coincident with areas over which the head glides when the arm assembly remains motionless. Each track is further partitioned into sectors. Each sector contains a larger fixed length area for user data, as well as header and trailer information used by the HDD electronics during the data storing and retrieval process. Data read and write times, called latency, are not as fixed and predictable on an HDD as compared to RAM. HDD latency, to a large extent, is a function of the seek time, i.e., the time it takes the arm to reposition the head over the track where the data is to be stored or retrieved. The seek time is variable and a function of the last position of the arm.
HDDs are typically designed as self-contained assemblies that can be plugged into a standard slot in a computer chassis or in a separate storage chassis. In an enterprise environment, a storage chassis has storage drawers that typically hold anywhere from a half dozen to as many as fifty or more individual HDDs. A storage chassis can be either a stand-alone assembly or a rack mountable unit to allow multiple storage drawers to be placed into a single rack, creating a relatively large array of HDDs in a small physical footprint. Drive density per unit area floor space is a competitive metric used in the industry to help potential customers compare offerings from different vendors.
HDDs are complex electromechanical subassemblies and as such are subject to a wide variety of failure mechanisms. Microscopic defects in the magnetic coating materials used on the platter, contamination of the platter with dust, dirt or magnetic particles and aging can all cause data loss. As with all electronics, random failures can occur from a wide variety of underlying physical processes or small defects associated with manufacturing processes. Moving parts are subject to friction and wear out over time, which can also cause HDD assemblies to fail.
HDD technologies have continued to evolve with higher density, faster devices, utilizing new and different disk designs being created at an accelerating rate of change. As HDD rotational speed continues to increase and as HDDs continue to be designed to hold increasing amounts of data, the physical area on a disk that holds the magnetic signature for each bit continues to become smaller, resulting in a greater engineering challenge to ensure reliable write and read operations. To reduce cost, there is now wider use of less expensive and in some applications, less reliable advanced technology attachment (ATA) drives and serial ATA (SATA) drives.
Techniques used to detect and correct bit errors have evolved into an elaborate science over the past several decades. Perhaps the most basic detection technique is the generation of odd or even parity, where the bits in a data word are exclusive OR-ed (XOR-ed) together to produce a parity bit. For example, a data word with an even number of ones will have a parity bit of zero, and a data word with an odd number of ones will have a parity bit of one. A single error in the data word can be detected by comparing the calculated parity to the originally generated parity for the data word.
It has been recognized that the parity technique of error detection could be extended to not only detect errors, but correct errors by appending an error correcting code (ECC) field to each data word. The ECC field may be a combination of different bits in a data word XOR-ed together so that errors (small changes to the data word) can be easily detected, pinpointed, and corrected. The number of errors that can be detected and corrected is directly related to the length of the ECC field appended to the data word. For ECC to function, a minimum separation distance between valid data words and code word combinations must be enforced. The greater the number of errors desired to detect and correct, the longer the code word, resulting in a greater distance between valid code words. The distance between valid code words is also known as the “Hamming distance”.
Error detection and correction techniques are commonly used to restore data in storage media where there is a finite probability of data errors due to the physical characteristics of the storage media. Circuits used to store data as voltage levels representing a one or a zero in RAM are subject to both device failure and state changes due to high-energy cosmic rays and alpha particles. HDDs that store ones and zeros, as magnetic signatures on a magnetic surface are also subject to imperfections in the magnetic media and other mechanisms that can cause changes in the data pattern from what was originally stored.
Memory ECC may use a combination of parity codes in various bit positions of a data word to allow detection and correction of errors. Every time a data word is written into memory, a new codeword is generated and stored with the data to support detection and correction.
Many error detection and correction techniques have been extended over the years to help ensure HDD failures do not cause data loss or data integrity issues. Embedded checking mechanisms, such as ECC, are often used on HDDs to detect bad sectors. Cyclic redundancy checks (CRCs) and longitudinal redundancy checks (LRCs) may be used by HDD electronics or a disk adapter to check for errors. Alternatively, higher levels of code and applications may use CRCs and LRCs to detect HDD errors. CRC and LRC values are written coincident with data to help detect data errors. CRCs and LRCs are hashing functions used to produce a small substantially unique bit pattern generated from the data. When the data is read from the HDD, the associated check value is regenerated and compared to the value stored on the platter. The signatures must match exactly to ensure that the data retrieved from the disk is the same as was originally written to the disk.
Redundant array of independent disks (RAID) systems have been developed to improve performance and increase availability of disk storage systems. RAID distributes data across several independent HDDs. Many different RAID schemes have been developed with different associated characteristics. Performance, availability, and utilization/efficiency (the percentage of the disks that actually hold user data) are perhaps the most important characteristics to consider in comparing RAID schemes. The tradeoffs associated with various schemes have to be carefully considered, because an improvement in one characteristic can often result in a reduction in another.
RAID 5 is used widely today, achieving a balance between performance, availability and utilization. RAID 5 uses a single parity field that is calculated by XORing data elements across multiple HDDs in a stripe. A “stripe” refers to a complete and connected set of data and parity elements that are dependently related to the parity computation relations. In coding theory, the stripe is a code word or code instance. In the event of a single HDD failure, data on the remaining disks in the stripe are XOR-ed together to recreate the data from the failed disk. As with many other RAID schemes, RAID 5 has a performance advantage in that the data from all HDDs in a stripe do not have to be read to recalculate a new parity value for the stripe every time a write occurs. When writing small amounts of data, such as updating single data elements, a technique known as read-modified-write (RMW) is used whereby old data from a single HDD is read along with old parity from another HDD. The old data is XOR-ed with new data and the old parity to produce a new parity value, which is then written back to disk along with the new data. RMW can be a considerable performance improvement especially with wide-width RAID 5 arrays. RAID 5 typically uses a distributed parity scheme whereby parity fields are substantially uniformly distributed across all the HDDs in the array to help balance read/write access to each HDD, ensuring more consistent performance.
A RAID 5 array can continue to operate after a single HDD has failed in the array. Data from the failed disk can be regenerated by XOR-ing data from the remaining disks in the data stripe with the parity field. When the failed HDD is replaced or if there is a spare HDD in a RAID 5 array, the data from the failed HDD can be completely recreated and rewritten to the new disk using the same XOR process. Systems are often designed such that failed HDDs can be replaced concurrently with normal system operation. Data on a replacement HDD is rebuilt in a process that can take several hours to complete. RAID 5 can only tolerate a single HDD failure, as there is no way to reconstruct the data when two HDDs fail in the same data stripe. If a second HDD in the RAID 5 stripe fails before the first failed HDD is replaced and rebuilt, all the data associated with the RAID 5 stripe will be lost. The probability of encountering a second HDD failure is directly related to how quickly the failed HDD is replaced or spared out and the data reconstructed and written to the replacement/spare HDD.
RAID 6 is an extension to RAID 5 where a second independent checksum field is introduced so that two HDD failures can be tolerated. RAID 6 is commonly implemented as a dual checksum fields for each stripe or row of data. In RAID 6, the second independent checksum field is typically created using Reed-Solomon codes which is a more complex operation than the simple RAID 5 XOR of the data elements and thus may be more difficult to implement, requiring additional computational resources.
An “array” typically refers to a collection of HDDs on which one or more instances of a RAID error correction code is implemented. Reed-Solomon codes can correct for erasures when the sources of the failures can be isolated through some independent means. This is often referred to as data erasure correction. Reed-Solomon codes also have the ability to pinpoint and correct a failure; however, the effectiveness of correction is cut in half when the failure cannot be pinpointed by some independent means. For example, RAID 6 can be used to correct up to two erasures when the failures are isolated through some independent means, or the RAID 6 code in and of itself can be used to pinpoint and correct a single failure. An “element” typically refers to a fundamental unit of data or parity, the building block of the error correction codes. In coding theory, an element or “symbol” may be composed of a fixed number of bits, bytes or blocks often stored as contiguous sequential sectors on an HDD. A “strip” typically refers to a collection of contiguous elements on a single HDD. A set of strips in a codeword form a stripe. A strip may contain data elements, parity elements or both from the same disk and stripe. In coding theory, a strip is associated with a code word and is sometimes called a stripe unit. It is common for strips to contain the same number of elements. In some cases, stripes may be grouped together to form a higher level construct know as a “stride”.
The availability of a RAID array is often characterized by its Hamming distance. For example, RAID 5 has a Hamming distance of two. RAID 5 can tolerate a single HDD failure, but cannot tolerate two or more HDD failures. RAID 6 has a Hamming distance of three since it can tolerate up to two HDD failures and still continue to operate. Often improvements in one performance attribute results in degradation of other attributes. For example, with all else being equal, RAID 6 may have lower performance than RAID 5, because the second checksum field may be updated on every write. RAID 6 may also be less efficient than RAID 5 due to the additional overhead of the second checksum field. RAID 5 adds the equivalent of one HDD to the array to hold the checksum field. In other words, for RAID 5 to store the equivalent of N data disks, N+1 physical disks are required. RAID 6 adds the equivalent of two HDDs to the array to hold two checksum fields. RAID 6 requires N+2 physical disks to hold the equivalent of N data disks.
A problem that can occur on disks is known as a “strip kill”, where a strip of data on the disks can no longer be read. A strip kill causes data loss to a small portion of data on the disks. With RAID 5, the data lost in a strip kill may be corrected by using the normal RAID XOR algorithm. Strip kills, although rare, can occur during a rebuild operation of a failed HDD. A strip kill may occur during a rebuild operation, because all the data on all the disks in the array must be read and XOR-ed together to reconstruct the data on the failed disks. If a strip kill is encountered during a RAID 5 rebuild, the rebuild cannot complete, and the data on the RAID 5 array is lost. A similar problem arises in RAID 6, if a rebuild of two HDDs is in process, and a strip kill is encountered. However, if there is a rebuild in process of a single HDD fail on a RAID 6 array and a strip kill is encountered, it is possible to recover in a similar manner as two HDD recovery for RAID 6.
Systems have been contemplated where parity is calculated horizontally and diagonally across strips of data in a single row of disks. Other systems have been contemplated that use horizontal and vertical parity, but are limited to square N×N implementations, where horizontal parity is calculated across a row of N disks and vertical parity is calculated across N strips of the row of N disks. Furthermore, such systems do not distribute parity and data elements across independent physical disks, limiting the failure recovery capability to a maximum of two HDDs. Previously contemplated RAID systems often included circular dependencies or other interdependencies that prevented data reconstruction of certain blocks after two HDD failures. Systems using diagonal parity also suffer from sizing constraints in that the number of columns cannot be greater than the number of rows when the diagonal parity is stored with each row.
While RAID 6 provides improved availability over RAID 5, both approaches breakdown when failures occur in multiple HDDs or in data elements aligned within an HDD row. For RAID 5, failures in a row alignment of two HDDs or two data element combinations in a stripe result in a system failure. For RAID 6, failures in a row alignment of three HDDs or three data elements combinations in the stripe result in a system failure. HDD failures are often modeled as independent random events; however, disk systems have been known to exhibit a cluster failure where a common problem source can take out multiple HDDs in a row. Both RAID 5 and RAID 6 are susceptible to cluster failures. Additionally, the higher availability of RAID 6 over RAID 5 typically requires more complex and costly hardware to implement Reed-Solomon coding in the second checksum. Accordingly, there is a need in the art for providing enhanced tolerance of data loss in a disk array system.