Stored data may be protected against storage media failures or other loss by storing extra copies, by storing additional redundant information, or in other ways. One type of redundancy based protection involves using erasure coding. Erasure coding uses additional redundant data to produce erasure codes (EC) that protect against ‘erasures’. An erasure may be an error with a location that is known a priori. The erasure codes allow data portions that are lost to be reconstructed from the surviving data. The application of erasure codes to data storage may typically have been for the purpose of recovering data in the face of failures of hardware elements storing the data. Tape cartridges using Dual Reed Solomon erasure coding can achieve a bit error rate (BER) significantly lower than hard disk drives (HDD). For example, using random error distribution, linear tape open (LTO) 6 tapes may achieve a BER of 1 in 1017 or even 1 in 1019 bits.
However, like HDDs, tapes exhibit non-Gaussian error modes that dominate the mean time between failures (MTBF). Tape drives often encounter errors during reading, including off track errors, media data errors, damaged tape, deteriorated tape, host drive speed mismatches, and other hardware and firmware problems. Conventional tape drives retry a read when an error is encountered. Retries result in repetitive repositioning, which combined with the high speeds of tape drives, leads to further deterioration and damage to the tape. The damage may include tape surface damage and air entrainment problems, which in turn lead to even more errors. Conventional tape formats do not have useful approaches to deal with hard read errors, other than retries with repositioning. Thus, if the data in the damaged section of tape cannot be read, conventional tape systems give up, even though the rest of the data on the tape is fine. Conventional systems therefore rely on tape backup copies to recover original data at the cost of overhead. However, the backup copies are also subject to the same errors, which may result in multiple unusable tape cartridges within a data storage system.
Erasure codes are often used to increase data storage durability, but come with the cost of overhead. However, the conventional deployment of erasure codes does not protect data from localized damage to tapes that is beyond the power of the systems internal to the tape system to correct. Conventional tape systems thus make multiple copies of cartridges, also known as replication, to achieve required levels of durability. For example, to achieve enterprise levels of durability, a conventional tape data storage system, even assuming errors were random, would require multiple copies of data. However, critical tape errors are not uniformly random.
LTO's internal error correction coding (ECC) system as used in conventional systems cannot efficiently deal with many types of hard errors, including lost cartridges, cut tapes, lost pins, environment issues, loss of magnetic coating, shock and vibration, edge damage, debris and particles, magnetic coating wear, and staggered wraps. For example, if a conventional system loses a cartridge because a robot dropped the cartridge or someone stole it, the data is gone, regardless of the BER or the ECC system employed. To handle these kinds of hard errors and achieve eleven nines or more of durability, conventional systems need at least six copies, potentially residing at different sites, which is costly and provides a significant tape management challenge. For example, if a file is distributed over 4 tapes to increase transfer rates but still needs to be replicated 6 times to achieve the desired durability, the system would need 24 tapes, which is not an optimal solution. Availability issues for a tape cartridge may occur at the tape level (e.g., lost tape, damaged tape) or at a system level (e.g., tape library robot down, unavailable).