1. Field
The subject matter disclosed herein relates to storage systems. In particular, the subject matter disclosed herein relates to a method for configuring a storage system comprising a plurality of arrays of storage units and thereby increasing the number of storage-unit failures that the storage system can tolerate without loss of data stored in the system.
2. Description of the Related Art
The following definitions are used herein and are offered for purposes of illustration and not limitation:
An “element” is a block of data on a storage unit.
A “base array” is a set of elements that comprise an array unit for an ECC.
An “array” is a set of storage units that holds one or more base arrays.
A “stripe” is a base array within an array.
n is the number of data units in the base array.
r is the number of redundant units in the base array.
m is the number of storage units in the array.
d is the minimum Hamming distance of the array.
D is the minimum Hamming distance of the storage system.
Large storage systems typically comprise multiple separate arrays of storage units. Each array is conventionally protected against a certain number of storage-unit failures (also called erasures) by an Erasure (or Error) Correcting Code (ECC) in, for example, a mirroring configuration or a RAID 5 (Redundant Array of Independent Disks Level 5) configuration. ECC codes provide redundant storage units that are local to each array, and increase reliability for a storage system by handling unit failures that are localized to a subset of the arrays.
Storage capacity of Hard Disk Drive (HDD)-based storage systems is increasing faster than improvements in component reliability. Consequently, minimum Hamming distance d=2 schemes, such as RAID 5 and mirroring techniques, no longer provide adequate protection for sufficient reliability at the system level. Alternative designs, such as RAID 6 (dual parity) at distance d=3, double mirroring at distance d=3, and RAID 51 at distance d=4, have been proposed to address deficiencies in system reliability. It is common practice in storage systems to provide spare units to decrease the system repair time and increase the maintenance interval. Adding spares, however, increases the cost of the system and decreases the storage efficiency.
Other approaches for improving system reliability include use of higher order parity codes. For example, J. S. Plank, “A Tutorial on Reed-Solomon Coding for Fault-Tolerance in RAID-like Systems,” Software—Practice & Experience, 27(9), September 1997, pp. 995-1012, discloses an example of a Reed-Solomon code.
Additionally, E. J. Schwabe et al., “Evaluating Approximately Balanced Parity-Declustering Layouts in Disk Arrays,” ACM 0-89791-813-4/96/05 1996, disclose data layouts for efficient positioning of redundant information for improved performance.
P. Chen et al., “RAID: High-Performance, Reliable Secondary Storage,” ACM
Computing Surveys, Vol. 26, June 1994, pp. 145-185, provide an overview of RAID. M. Holland et al., “Parity Declustering for Continuous Operation In Redundant Disk Arrays,” Proceedings of the 5th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-V), pp. 23-25, October 1992, disclose declustered parity for RAID systems. G. A. Alvarez et al., “Tolerating Multiple Failures in RAID Architectures,” ACM 0-89791-901-7/97/0006 1997 describe the properties and construction of a general multiple parity array using 8-bit finite fields.
U.S. Pat. No. 5,579,475 to M. M. Blaum et al., entitled “Method and Means for Encoding and Rebuilding the Data Contents of Up to Two Unavailable DASDs in a DASD Array Using Simple Non-Recursive Diagonal and Row Parity,” discloses the operation of a distance d=3 array. N. K. Ouchi, “Two-Level DASD Failure Recover Method,” IBM Technical Disclosure Bulletin, Vol. 36:03, March 1993, discloses operations required for reconstructing data from a distance d=3 array having failures.
Nevertheless, some array designs, such as product codes (including RAID 51), have vulnerabilities to certain patterns of storage unit failures. These arrays behave somewhat as if they possess local redundancy.
What is needed is a technique to improve the reliability of a storage system by making local redundancy in an array to be globally available throughout a system of arrays. Additionally, what is needed is a technique to improve the reliability of a storage system that has sensitivity to patterns of storage unit failures. Further still, what is needed is a technique that allows maintenance of the storage system to be deferred for considerably longer than can be with a conventional storage system.