The disclosed embodiments are directed toward digital decoding and, specifically, to cooperative decoding in hyperscale data clusters.
A hyperscale data center employs storage nodes in the form of storage clusters. A storage cluster may include one or more storage devices organized into storage pools. These storage devices are used to support storage requirements of, for example, network applications.
To support high performance applications, a level of redundancy is needed to ensure that the failure of a single drive does not negatively impact downstream applications. One approach is to replicate data across drives. Thus, a single item of data is stored in separate, non-overlapping storage devices. The deficiency of this approach is that the amount of storage increases linearly with the amount of data stored. Since the amount of data used by network applications grows exponentially, the amount of storage devices needed to support this scheme grows exponentially and is thus impractical in terms of energy needed, costs of storage devices, and scalability.
Another approach is to organized storage devices in erasure coded pools. Erasure coded pools have the advantage of only storing data once. To support this reduced storage, additional computational complexity is required. However, the tradeoff is generally preferable to replicated systems. In an erasure coded pool, data is segmented into individual symbols. These symbols are then distributed to different storage devices. For example, a data word (ABCD) may be split into separate symbols (A, B, C, D) and stored in four separate drives.
To handle drive failures, a level of redundancy is needed to ensure that if one symbol is lost, it can be recovered. To accomplish this, many systems utilized Reed-Solomon (RS) encoding to add additional parity bytes to a given item of data (e.g., ABCD12, where 1 and 2 are parity symbols). The choice of parity bits dictates how many symbols can be recovered. For example, with two parity bits, a system can detect two errors and correct one. These symbols are then distributed to different storage devices, where oftentimes dedicated storage devices are used to store parity symbols. Current systems generally hard-decision RS decoding in order to detect and, if possible, correct erasures.
Current systems additionally utilize further encoding to protect against errors at various levels of the storage hierarchy. One current technique is to use low-density parity check (LDPC) algorithms to perform error correction on the data at the drive-level. In general, these algorithms are not concerned with the form of data encoded and decoded. Rather, LDPC codes are used simply to correct drive-level errors or channel errors. The data is then simply returned to the RS decoder and the RS decoder performs a hard decoding of the returned data. In some systems, RS decoding is skipped if all drives produced error-free data. Thus, the RS decoder is frequently unused. This results in hardware that consumes power and clock cycles while performing no useful work. Additionally, in current systems, the LDPC decoding employed is a hard-decision decoding. As is known, hard-decision coding is time-consuming and results in reduced response time for read requests.