Data deduplication removes redundancy while erasure coding adds redundancy. Data deduplication represents an original set of symbols in a smaller set of code symbols while erasure coding represents an original set of symbols in a larger set of code symbols. Thus, conventionally there has been no reason to use deduplication and erasure coding together.
Data that is stored or transmitted may be protected against storage media failures or other loss by storing extra copies or by storing additional redundant information. One type of redundancy-based protection involves using erasure coding. Erasure coding creates additional redundant data to produce code symbols that protect against ‘erasures’ where data portions that are lost can be reconstructed from the surviving data. Adding redundancy introduces overhead that consumes more storage capacity or transmission bandwidth, which in turn adds cost. The overhead added by erasure code processing tends to increase as the protection level provided increases.
While erasure codes increase data storage requirements by introducing additional redundancy, data deduplication seeks to reduce data storage requirements by removing redundancy. Data deduplication seeks to remove redundancy within a data set by representing an original set of symbols in a smaller set of code symbols. By representing data with a reduced number of code symbols, data storage space and communication capacity use are improved, which may in turn reduce cost.
The lack of redundancy in deduplicated data causes some unique data identified during deduplication to be less protected than others with respect to storage media failure or other loss. Over time, some unique data may become more or less valuable than other unique data. For example, one piece of unique data may be used to recreate hundreds of documents while another piece of unique data may only be used to recreate a single document. While loss of the unique data that is used for one document would be bad, the loss of the unique data that is used in the hundreds of documents may be worse. In some cases, the loss of the unique data used to recreate even a single document may be catastrophic when the data concerns, for example, user authentication or system security.
To enhance data protection, different approaches for storing redundant copies of items have been employed. Erasure codes are one such approach. An erasure code is a forward error correction (FEC) code for erasure channels. The FEC facilitates transforming a message of k symbols into a longer message with n symbols so that the original message can be recovered from a subset of the n symbols, k and n being integers, n>k. The symbols may be individual items (e.g., characters, bytes) or groups of items. The original message may be, for example, a file. The fraction r=k/n is called the code rate, and the fraction k′/k, where k′ denotes the number of symbols required for recovery, is called the reception efficiency or coding overhead. Optimal erasure codes have the property that any k out of the n code word symbols are sufficient to recover the original message (e.g., coding overhead of unity). Optimal codes may require extensive memory usage, CPU time, or other resources when n is large. Erasure coding approaches may seek to create the greatest level of protection with the least amount of overhead via optimal or near optimal coding. Different types of erasure codes have different efficiencies and tradeoffs in terms of complexity, resources, and performance.
Erasure codes are described in coding theory. Coding theory is the study of the properties of codes and their fitness for a certain purpose (e.g., backing up files). Codes may be used for applications including, for example, data compression, cryptography, error-correction, and network coding. Coding theory involves data compression, which may also be referred to as source coding, and error correction, which may also be referred to as channel coding. Fountain codes are one type of erasure codes.
Fountain codes have the property that a potentially limitless sequence of code symbols may be generated from a given set of source symbols in a manner that supports ideally recovering the original source symbols from any subset of the code symbols having a size equal to or larger than the number of source symbols. A fountain code may be optimal if the original k source symbols can be recovered from any k encoding symbols, k being an integer. Fountain codes may have efficient encoding and decoding algorithms that support recovering the original k source symbols from any k′ of the encoding symbols with high probability, where k′ is just slightly larger than k (e.g., an overhead close to unity). A rateless erasure code is distinguished from an erasure code that exhibits a fixed code rate.
Storage systems may employ rateless erasure code technology (e.g., fountain codes) to provide a flexible level of data redundancy. The appropriate or even optimal level of data redundancy produced using a rateless erasure code system may depend, for example, on the number and type of devices available to the storage system. The actual level of redundancy achieved using a rateless erasure code (EC) system may depend, for example, on the difference between the number of readable redundancy blocks (e.g., erasure code symbols) written by the system and the number of redundancy blocks needed to reconstruct the original data. For example, if twenty redundancy blocks are written and only eleven redundancy blocks are needed to reconstruct the original data that was protected by generating and writing the redundancy blocks, then the original data may be reconstructed even if nine of the redundancy blocks are damaged or otherwise unavailable.
An EC system may be described using an A/B notation, where B describes the total number of encoded symbols that can be produced for an input message and A describes the minimum number of the B encoded symbols that are required to recreate the message for which the encoded symbols were produced. By way of illustration, in a 10 of 16 configuration, or EC 10/16, sixteen encoded symbols could be produced. The 16 encoded symbols could be spread across a number of drives, nodes, or geographic locations. The 16 encoded symbols could even be spread across 16 different locations. In the EC 10/16 example, the original message could be reconstructed from 10 verified encoded symbols.