Data storage utilization is continually increasing, causing the proliferation of storage systems in data centers. Hard disk drives are the primary storage media in enterprise environments. Despite the central role hard disks in storing precious data, they are among the most vulnerable hardware components in a computer system. Storage systems have relied on redundancy mechanisms such as RAID to tolerate disk failures. However, RAID's protection is weakened given the fault model presented by modern disk drives. For example, in production systems many disks fail at a similar age; this means RAID systems face a high risk of multiple whole-disk failures. The increasing frequency of sector errors in working disks means RAID systems face a high risk of reconstruction failure. In short, RAID passive protection is not robust enough in the face of these new challenges.
Much of RAID previous work has focused on improving redundancy schemes to tolerate more simultaneous disk failures. However, some data analysis reveal that the likelihood of simultaneous whole-disk failures increases considerably at certain disk ages. Further, the accumulation of sector errors contributes to the whole-disk failure causing the disk reliability to deteriorate continuously. Hence, ensuring data reliability in the worst case requires adding considerable extra redundancy, making a traditional passive approach of RAID protection unattractive from a cost perspective.