1. Field of the Invention
The present invention generally relates to a system and method for recovering data from lost sectors in a storage system (e.g., storage networks, storage nodes, disk array controller, etc.), and more particularly to a system and method for identifying lost sectors, determining which lost sectors have data capable of recovery, and generating formulas for recovering the data from those sectors with recoverable data.
2. Description of the Related Art
Generally, erasure codes (e.g., RAID schemes) are fundamental tools for providing data reliability in storage systems in the presence of unreliable disks. Conventionally, RAID4 and RAID5 systems protect against one disk loss or unaligned sector loss (not more than one sector per horizontal slice). Erasure codes that tolerate two disk failures have begun to be deployed. However, better fault-tolerance will be needed as more systems move to Advanced Technology Attachment (ATA) (e.g., non-Small Computer System Interface (non-SCSI)) drives.
Erasure codes such as RAID4 and RAID5 rely on a single level of redundancy (e.g., see P. Massiglia, The RAID Book, St Peter, Minn.: The RAID Advisory Board, Inc., 1997, which is incorporated herein by reference in its entirety) and so can protect against a single disk failure. Other published algorithms employed by conventional systems and methods are implemented only in the “two disk” loss failure scenario. That is, each specific 2-fault tolerant erasure code generally is published with a specific algorithm for recovery in the “two disk lost” case. More generally erasure codes that tolerate T failed disks are published with descriptions on how to recover the entire data on any T lost disks. Particularly, the Reed-Solomon scheme generally is employed, which uses linear algebra over finite fields to solve the “T disk lost” case. However, this is very complicated and typically requires either additional special purpose hardware or complicated and expensive software.
Though conventional systems recover data from entire lost disks, there is a higher probability that only partial disks have failures. For example, a medium error or hard error on a disk implies loss of access to the data stored only on the failing sector or sectors. A sector loss occurs when the disk containing that sector fails or when the disk returns an error when reading or writing to that sector. In many conventional systems, such sector losses are viewed as disk losses so that the known and published recovery algorithms can be applied. If the sector losses are scattered across the disks, in particular over more disks than the erasure code can tolerate, the published recovery algorithms do not apply. In general, such systems will declare a “data loss event”, saying that the data on the lost sectors cannot be recovered from the available data in the system. In some cases, for example RAID4 and RAID5, it is easy to determine if scattered lost sectors have recoverable data or not: if any two lost sectors are on the same horizontal offsets in the disks, then their data cannot be recovered (a data loss event), else the data can be recovered. For all other conventional systems, such determination is neither obvious nor available in the published literature. In general, a data loss event declaration may be made by the system even though certain lost data may in fact be recoverable by a method that goes beyond the published algorithms.