The reduction in size of semiconductor process geometries and the use of multi-level cell techniques in NAND flash memory technology has significantly increased the bit density and reduced the cost per bit, resulting in the widespread adoption of flash memory-based storage for diverse applications from personal mobile devices to enterprise storage systems. With the increase in bit density, there is a reduction in reliability and robustness, with increased propensity for errors and reduced longevity. Flash memory error rates are known to increase with: (i) the number of program/erase cycles the memory has been subjected to; (ii) the length of time the data has been stored; and (iii) the use of smaller cell geometries and multi-level cell techniques. This has meant that stronger Error Correcting Codes (ECCs) to detect and correct bit errors are required to compensate for the increased error rates and reduced longevity.
NAND flash memories generally employ a systematic ECC, which is formed by adding redundant bits (often called parity bits) to the data bits according to a deterministic encoding algorithm. The original data bits, along with the extra parity bits, are then stored in the memory. Upon reading, the data bits and parity bits are passed through a decoding algorithm which either provides data bits which have been corrected for errors, or otherwise, an indication that an uncorrectable error has been detected (when the number of errors exceeds the error correcting capability of the code) in which case it is generally not possible to say which data bits are correct and which are not and a read failure occurs. Such codes are also called Forward Error Correcting (FEC) codes as the parity bits are added in advance of the data actually being stored.
As the bit density of data stored in NAND flash memory has increased, so has the complexity and sophistication of the ECC algorithms employed increased, from Hamming codes for SLC (Single Level Cell) memories, to Bose-Chaudhuri-Hocquenghem (BCH) codes for MLC (Multi-Level Cell) memories. to Low Density Parity Check (LDPC) codes for TLC (Tri-Level cell) and sub-20 nanometer cell geometry memories.
It is also possible for the number of errors to exceed the error correcting capability of the ECC to such an extent that the decoding process mis-decodes the data to a completely different set of data bits than those originally encoded. In some rare cases, it may even be possible with some ECC schemes (notably LDPC codes) for the number of errors to be quite small, yet the decoding process can mis-decode the data in a similar way. To detect these decoding errors, checksum bits are generally added to the original data using, for example, Cyclic Redundancy Check (CRC) code bits, or by adding an outer layer of secondary ECC scheme (particularly when LDPC is used). After the data is decoded using the ECC, a final check of the correctness of the data is made using the CRC decoder or the secondary ECC scheme. Even if the ECC decoder apparently decodes correctly, if the CRC or secondary ECC decoding fails, an unrecoverable error is returned. Hence, despite the increased error detection and correction capability of the more sophisticated ECC algorithms employed in combination with CRC checksum bits or secondary ECC schemes, it still remains possible for unrecoverable errors to occur.
An Open-Channel SSD is a type of SSD which has the ability to leave certain aspects of the management of the physical solid-state storage, such as the Flash Translation Layer (FTL), to the host device to which the open-channel SSD is connected. The ECC may also either be implemented at the device level, or left to the host to handle. Linux™ 4.4 kernel is an example of an operating system kernel that supports open-channel SSDs which follow the NVM Express™ specification, by providing an abstraction layer called LightNVM.
Moving from a single SSD into a storage appliance comprising an array of SSDs, the biggest threat at the array level is disk failure, which may be when a read operation fails (typically when an uncorrectable error occurs) or when an entire SSD malfunctions for some reason. As indicated, in the event of an unrecoverable error or disk failure occurring, it becomes impossible to retrieve any meaningful data from the failing SSD, and so the problem is commonly addressed by storing data across a group of disks within the whole array of SSDs such that data may be recovered from a subset of the disks in the group containing the failing disk. An erasure code (EC) is a FEC code for the binary erasure channel, which transforms a message of k symbols into a longer message (code word) with n symbols, such that the original message can be recovered from a subset of the n symbols. Popular erasure codes for SSD based storage appliances and other devices utilizing arrays of SSDs are Redundant Array of Independent Disks (RAID)-5 and RAID-6 configurations.
RAID-5 consists of block-level striping with distributed parity across the array. Upon failure of a single disk, subsequent reads can be calculated from the distributed parity such that no data is lost. RAID-6 extends RAID-5 by adding another parity block; thus it uses block-level striping with two parity blocks distributed across all member disks. EC utilization is the ratio between parity disks and data disks, e.g., 4+1 RAID-5 has a 20% overhead. Moreover, traditional RAID configurations do not take advantage of the global FTL that is possible with an array of open-channel SSDs.
There is, therefore, an unmet demand for a pool-level or global ECC mechanism for an array of SSDs within a storage appliance with reduced parity overhead, and for global ECC to be integrated into the global FTL for an array of open-channel SSDs within a storage appliance to increase operational efficiency.