1. Field of the Invention
The present invention relates to storage systems. More particularly, the present invention relates to a system, a method and a storage format that provides protection against uncorrectable media errors.
2. Description of the Related Art
FIG. 1 shows an exemplary high-RPM Hard Disk Drive (HDD) 100 having a two-stage servo system for positioning a magnetic read/write head (or recording slider) 101 over a selected track on a magnetic disk 102. The two-stage servo system includes a voice-coil motor (VCM) 103 for coarse position a read/write head suspension 104 and a microactuator, or micropositioner, for fine positioning read/write head 101 over the selected track in a well-known manner. Binary data is stored on magnetic disk 102 by selectively orienting magnetization in user data fields in the magnetic media of disk 102.
The two primary sources of data loss from HDDs, such as the exemplary HDD shown in FIG. 1, are disk drive failure and uncorrectable media error. Data loss has been conventionally prevented by configuring storage systems having an array of multiple HDDs in a RAID configuration in which data is striped across multiple HDDs. Redundancy is built into the striping so that should any HDD fail, the data belonging to the failed HDD can be reconstructed from the remaining drives of the storage system.
HDD storage capabilities have been increasing at a rate of between 60 and 100 percent per year. The probability of uncorrectable read errors, however, has been relatively constant at about 1 uncorrectable read error in 1014 bits. Accordingly, as HDD storage capabilities have increased, the probability of data loss due to an uncorrectable media error has become a significant factor.
Multiple HDD storage systems configured as RAID level 5 systems are commonly deployed in the industry and can tolerate loss of a single disk HDD. While a failed HDD is being rebuilt, however, a second HDD failure or an uncorrectable media error on any of the remaining HDDs will result in data loss. Data loss caused by a second HDD failure is referred to as an “array loss,” while data loss caused by an uncorrectable media error is referred to as a “strip kill.” It is estimated that there will be 1.48 array losses and 2570 strip kills in a one-year period for an installed base of one million 300 GB HDDs that are configured in 8-drive RAID 5 array systems with each HDD having an MTBF of 500,000 hours. It should be noted that over 90% of media errors affect single sectors. About 5% of media errors affect two to four sectors. Very few media errors affect multiple (seven or more) sectors.
Techniques have been proposed for reducing the probability of data loss. In particular, RAID-type protection techniques have been developed for protecting against drive failure by increasing the redundancy of the array using levels (such as RAID 51, RAID 6, RAID (3+3) and so on). When a RAID level is chosen for a storage system, factors that are considered include storage efficiency, reliability and performance. Optimizing any one of these three factors causes at least one of the other factors to become less than optimal.
Table 1 is a comparison of the different conventional RAID techniques.
TABLE 1RAID 5RAID 51RAID (3 + 3)RAID 6RAID N + 3Drives/81661616arrayStorage 87.5%43.75%50%87.5%81.25%Effi-ciencyAnnual 25706.17 × 10−78.55 × 10−7 1.617.53 × 10−4Strip KilleventsAnnual 1.485.01 × 10−83.56 × 10−102.41 × 10−31.51 × 10−6Array LosseventsPerfor-46668mance(IOs/writes)
The parameters on which the reliability calculations in Table 1 are based are an installed base of one million 300 GB disk drives each having an MTBF of 500,000 hours and a hard error rate of 1 error in 1014 bits.
As can be seen from Table 1, a RAID) 6 system configuration provides an adequate protection against array loss events exhibiting only 2.41×10−3 array loss events per year. The number of strip kills (i.e., 1.61 strip kill events per year) is too many to meet the requirements of high-end storage systems. Adding another level of protection comes at a price, such as reduced storage efficiency (i.e., a RAID (3+3) or a RAID 51 system configuration) or reduced performance (i.e., a RAID N+3 system configuration).
While RAID-type protection techniques have been developed for protecting against drive failure, RAID techniques do not protect well against uncorrectable media error and result in coarse granularity and sub-optimal tradeoffs. Consequently, what is needed is a technique that provides protection against uncorrectable media errors.