The present invention relates generally to storage systems, and more specifically, to nested multiple erasure correcting codes for storage arrays.
Computer systems utilize data redundancy schemes such as parity computation to protect against loss of data on a storage device. In redundant arrays of independent disks (RAID) systems, data values and related parity values are striped across disk drives. RAID systems are typically used to protect information stored in hard disk drives (HDDs) arrays from catastrophic disk failures. Two popular RAID schemes are RAID 5 which protects against a single catastrophic disk failure and RAID 6 which protects against a double catastrophic disk failure.
Flash devices are a type of non-volatile storage devices that can be electrically erased and reprogrammed in large blocks. Like HDDs, flash devices divide the medium into sectors that are typically 512 bytes. Flash devices further collect sectors into pages with typically eight sectors per page, so that each page contains four thousand or 4 kilo (K) bytes. Each sector is protected by an error correcting code (ECC) that corrects a number of errors (typically, single-bit errors, although other possibilities, like byte errors, are also feasible). A popular choice is a Bose-Chaudhuri-Hocquenghem (BCH) code, like an eight bit correcting or fifteen bit correcting BCH code, although many variations are possible. As in HDDs, pages in flash devices may suffer hard errors (HEs). This occurs, for example, when the error correcting capability of the BCH code in a sector of the page is exceeded. As compared to HDDs, exceeding the capability of the BCH code is more likely in flash devices, both as a page nears the end of its write endurance lifetime, or as a page nears the end of its data retention lifetime. Thus, the number of HEs in flash devices may be expected to grow over time, leaving latent HEs on a device.
An array made up of flash devices may encounter a mix of catastrophic device failures combined with possibly more prevalent HEs. For example, use of RAID 5 for protecting information stored in flash devices may result in a device failure when there are latent HEs. Therefore, if a device in a RAID 5 system experiences a catastrophic device failure, and some other device has a HE in a page, the row where such an event occurs will be unable to retrieve the information. RAID 6 requires dedicating an entire second device for parity, which is expensive when the predominant failures are HEs.