1. Field of the Invention
The present invention relates to storage systems. In particular, the present invention relates to a system and a method that utilizes a modified parity check matrix for increasing the number of storage-unit failures that the array can tolerate without loss of data stored on the array.
2. Description of the Related Art
The increased storage capacity of Hard Disk Drives (HDDs) and HDD-based storage systems are being used for storing large quantities of data, such as reference data and backup data, in a rare write, infrequent read access (near-line storage) configuration. Another exemplary application storing a large amount of data is the Picture Archive and Communication System (PACS) with which about 6600 hospitals in the U.S. yearly generate nearly 1 PB of medical imaging data. Yet another exemplary application storing a large amount of data is an e-mail system, such as Microsoft Hotmail e-mail system, which is purportedly approaching a PB in size. Accordingly, the increased storage capacity places stringent failure tolerance requirements on such HDD-based storage systems.
A known system uses low-cost HDDs that are configured in a RAID 5 blade as the basic building block. Multiple blades are then configured as a further array, such as in a RAID 1 or RAID 5 configuration, for enhancing the failure tolerance. Such an arrangement has the appearance of being a product of two parity codes, yet implementation as a nested array significantly reduces failure tolerance.
For example, FIG. 1 shows an exemplary array 100 of fifteen HDDs configured as three blades. Each blade contains five HDDs in a 4+P RAID 5 configuration. The blades are further configured as a 2+P RAID 5. Accordingly, blade 101 includes HDDs D11, D12, D13, D14 and P15, in which HDDs D11, D12, D13 and D14 store data and HDD P15 stores parity information for blade 101. Blade 102 includes HDDs D21, D22, D23, D24 and P25, in which HDDs D21, D22, D23, D24 store data and HDD P25 stores parity information for blade 102. Blade 103 includes HDDs P31, P32, P33, P34 and P35, in which HDDs P31, P32, P33, and P34 respectively store parity information for columns 111–114 and HDD P35 stores parity information for blade 103 (and for column 115). As indicated in FIGS. 1–5, 6 and 8, the first digit of an HDD designator represents the blade or row number of the HDD and the second digit represents the column number of the HDD.
The general arrangement of FIG. 1 is commonly referred to as a product code because it is the product of two parity codes. The minimum distance of a product code is the product of the individual distances or, in this case, 4. In a product code, many reconstructions must be performed iteratively.
FIG. 2 depicts exemplary array 100 of FIG. 1 having four HDDs failures that are correctable as the product of two parity codes. In FIG. 2, HDDs D11, D12, D22 and D23 have failed, as indicated by an X through each failed HDD. While a nested array configured as a RAID5(RAID5) is also distance 4, in general, such a configuration cannot recover from the particular failure arrangement shown in FIG. 2. In the case of the nested array, Blades 101 and 102 are both unable to correct the two failures with the inner RAID, and the outer RAID cannot recover from two blade failures. In contrast, a product code can recover from the failure arrangement shown in FIG. 2 because the HDDs are not viewed as virtual HDDs. HDD D11 is recovered by the column 111 parity, and HDD D23 is recovered by the column 113 parity. Following these two operations, blades 101 and 102 each have only a single failure, and can be recovered by the respective row parity. While stronger than a nested array, the product code is still only distance 4.
In general, product codes of this type are called products of parity stripes, and are known. There are, however, many failure combinations that product codes cannot correct. For example, FIG. 3 depicts exemplary array 100 of FIG. 1 in which HDDs D12, D13, D22 and D23 have failed, as indicated by an X through the HDD. This particular set of disk failures is not correctable because the product of two parity codes does not provide a linearly independent set of parity equations.
What is needed is a technique for improving the fault tolerance of an array of HDDs beyond to the capability of conventional product code techniques.