1. Technical Field
The present invention relates to providing protection from data loss in an array of storage devices. More particularly, the invention concerns reducing the probability of data loss due to clustered storage device failures in an array of storage devices.
2. Description of Related Art
Important data is often stored in storage devices in computing systems. Because storage devices can fail and data in failed storage devices can be lost, techniques have been developed for preventing data loss and for restoring data when one or more storage devices fail.
One technique for preventing data loss comprises storing parity information on a storage device (such as a disk drive), which is a member of a storage array, and storing data on one or more of the other storage devices in the array. (Herein a disk drive may be referred to as a “disk”.) With this technique, if a storage device fails, parity information can be used to reconstruct the data that was on the failed storage device. Moreover, if sufficient parity information is added to another storage device, the additional parity information may be used to reconstruct data stored on more than one failed storage device.
Another technique for preventing data loss, called data mirroring, comprises making a duplicate copy of data on a separate storage device. With this technique, if a storage device fails, data can be restored from the copy of the data. Individual storage devices, or entire arrays of storage devices may be mirrored to protect data.
Data mirroring and parity information storage, or a combination of the two, may be implemented on a Redundant Array of Inexpensive (or Independent) Disks (RAID), which may be used to provide a data storage system that has increased performance and capacity. Also, a technique called striping may be utilized with RAID arrays, wherein data records and parity information are divided into strips such that the number of strips equals the number of disks in the array. Each strip is written or “striped” to each of the different disks in the RAID array, to balance the load across the disks and to improve performance. A group of strips comprising one pass across all of the drives in a RAID is called a stride. Several RAID protocols have been devised, wherein different mirroring, parity, and striping arrangements are employed. As an example, in a RAID 5 array consisting of six disks, five data strips and one parity strip are striped across the six disks, with the parity information rotated across the disks. The rotation of the parity across the disks ensures that parity updates to the array are shared across the disks. RAID 5 provides a redundancy of one (also called a Hamming distance), which means that all data can be recovered if any one and only one of the disks in the array fails.
Drive failures in general, and clustered failures in particular, are intrinsic characteristics of specific drive products, and are a function of design characteristics as well as a number of factors such as the quality of manufacture and the drive's sensitivity and reliability as function of environment and workload. Some designs are robust and have no clustering phenomena, while others exhibit problematic clustered failure characteristics. For example, some designs may be subject to simultaneous failures within a range of power on hours. Others may exhibit clustering with entirely different time scales and triggering mechanisms. For example, some designs may operate without problems but then become susceptible to clustered failures if power to the drives is cycled.
RAID schemes which provide higher data redundancy, such as RAID 6, RAID 51, Symmetric RAID (n+n), and double or higher mirroring are increasingly becoming necessary to reduce the probability of data loss as a consequence of normal drive failure rates. These higher codes generally require an increase in the number of disk drives, or alternately are achieved at a significant loss in effective capacity. For example, a user may opt to go from a 5 disk RAID 5 array to a 10 disk RAID 51 array wherein the RAID 5 array is mirrored. As another example, the storage efficiency for a RAID 6 array, for the same data storage capacity as a RAID 5 array, is lower than the RAID 5 array because a RAID 6 array requires an additional disk. RAID 6 has an arrangement similar to RAID 5, but requires two parity strips in each stride, to provide a redundancy of two. Although these RAID schemes provide increased protection from data loss, these schemes often do not provide sufficient redundancy to permit recovering from a clustering of failures for a particular drive product, wherein a number of drives fail simultaneously or during a short period of time. For example, although some of these RAID schemes provide a Hamming distance of up to 4, these schemes are not capable of addressing clustering failures when more than 3 drives fail in a short period of time. Consequently, known techniques are inadequate for preventing data loss when clustered storage device failures occur.