Data deduplication seeks to remove redundancy within a data set by representing an original set of symbols in a smaller set of code symbols. By representing data with a reduced number of code symbols, data storage space and communication capacity usage are improved, which may in turn reduce cost. However, the lack of redundancy in deduplicated data causes some unique data identified during deduplication to be less protected than other unique data with respect to storage media failures, errors, erasures, or other loss. Over time, some unique data may become more or less valuable than other unique data. For example, one piece of unique data may be used to recreate hundreds of documents while another piece of unique data may only be used to recreate a single document.
Data de-duplication reduces the storage space requirements and improves the performance of data storage operations by eliminating duplicate copies of repeating data. De-duplication may involve dividing a larger piece of data into smaller pieces of data. Larger pieces of data may be referred to as “blocks” while the smaller pieces of data may be referred to as “sub-blocks” or “chunks”. Dividing blocks into sub-blocks or chunks may be referred to as “chunking”. De-duplication may be referred to as “dedupe”.
One type of chunk-based de-duplication is inline de-duplication. Inline chunk-based de-duplication may de-duplicate data based on variable sized chunks before the de-duplicated data is written to a storage device. For a backup application, no knowledge of the backed-up data format is needed when using inline chunk-based de-duplication. One of the challenges of chunk-based de-duplication is the identification of repeating patterns. Once a sub-block has been created, there are different approaches for determining whether the sub-block is a duplicate sub-block, whether the sub-block can be represented using a delta representation, whether the sub-block is a unique sub-block, and so on. One approach for determining whether a sub-block is unique involves hashing the sub-block and comparing the hash to hashes associated with previously encountered and/or stored sub-blocks. Different hash functions may yield more or less unique determinations due, for example, to a collision rate associated with the hash function. Since different hashing schemes may yield more or less unique determinations, the different hashing approaches may also have different performance levels and may yield different amounts of data reduction. Conventional approaches to de-duplication have typically preferred strong hash functions amenable to simple implementation in order to minimize collisions.
Conventionally, chunks are tagged with specific indexes or IDs based on a relatively stronger class of hash functions. A strong hash function is hash function that for a given pair of keys, has a low probability of hashing to the same index. A weak hash function is a hash function that for a given pair of keys, has a higher probability of hashing to the same index. In a deduplication system, hash-based IDs are typically stored in a hash table or a chunk index (ID) table. The hash table or chunk ID table stores reference counts and pointers to where unique chunks are stored on disk or other storage medium. Since IDs are compared to identify identical chunks to be stored or not stored, hash collisions may compromise data integrity. Using larger hash tags can reduce the frequency of hash collisions, but at the cost of consuming more memory. To maintain fast chunk ID lookup, hash tables or chunk index files are typically stored in a medium that has faster access than the medium in which the de-duplicated data will be stored. For example, the hash table may be stored in a Random Access Memory (RAM) while the de-duplicated data may be stored on disk or tape. However, at large scales, the hash table or chunk ID table containing the chunk IDs may increase in size and eventually overflow the amount of RAM available. When this happens, the remaining chunk ID table data is paged to disk or other storage media that has a slower access time than RAM. Paging the remaining index data to disk may cause delays and reduces de-duplication throughput, and is known as the chunk look-up disk bottleneck problem.
Erasure coding creates additional redundant data to produce code symbols that protect against ‘erasures’ where data portions that are lost can be reconstructed from the surviving data. Adding redundancy introduces overhead that consumes more storage capacity or transmission bandwidth, which in turn adds cost. The overhead added by erasure code (EC) processing tends to increase as the protection level provided increases.
An erasure code is a forward error correction (FEC) code for the binary erasure channel. An FEC facilitates transforming a message of k symbols into a longer message with n symbols such that the original message can be recovered from a subset of the n symbols, k and n being integers. The original message may be, for example, a file. The fraction r=k/n is called the code rate, and the fraction k′/k, where k′ denotes the number of symbols required for recovery, is called the reception efficiency. Optimal erasure codes have the property that any k out of the n code word symbols suffice to recover the original message with a reception efficiency of unity. Optimal codes may require extensive memory usage, CPU time, or other resources when n is large and the code rate is low.
Erasure codes are described in coding theory. Coding theory is the study of the properties of codes and their fitness for a certain purpose (e.g., backing up files). Codes may be used for applications including, for example, data compression, cryptography, error-correction, and network coding. Coding theory involves data compression, which may also be referred to as source coding, and error correction, which may also be referred to as channel coding. Fountain codes are one type of channel erasure code.
Some storage systems may employ rateless erasure code technology (e.g., fountain codes) to provide a flexible level of data redundancy. The appropriate or even optimal level of data redundancy produced using a rateless erasure code system may depend, for example, on the number and type of devices available to the storage system. The actual level of redundancy achieved using a rateless erasure code system may depend, for example, on the difference between the number of readable redundancy blocks (e.g., erasure code symbols) written by the system and the number of redundancy blocks needed to reconstruct the original data. For example, if twenty redundancy blocks are written and only eleven redundancy blocks are needed to reconstruct the original data that was protected by generating and writing the redundancy blocks, then the original data may be reconstructed even if nine of the redundancy blocks are damaged or otherwise unavailable.
Fountain codes have the property that a potentially limitless sequence of code symbols may be generated from a given set of source symbols in a manner that supports ideally recovering the original source symbols from any subset of the code symbols having a size equal to or larger than the number of source symbols. A fountain code may be optimal if the original k source symbols can be recovered from any k encoding symbols, k being an integer. Fountain codes may have efficient encoding and decoding algorithms that support recovering the original k source symbols from any k′ of the encoding symbols with high probability, where k′ is just slightly larger than k (e.g., an overhead or reception efficiency close to unity). A rateless erasure code is distinguished from an erasure code that exhibits a fixed code rate.
An EC system may be described using an A/B notation, where B describes the total number of encoded symbols that can be produced for an input message and A describes the minimum number of the B encoded symbols that are required to recreate the message for which the encoded symbols were produced. By way of illustration, in a 10 of 16 configuration, or EC 10/16, sixteen encoded symbols could be produced. The 16 encoded symbols could be spread across a number of drives, nodes, or geographic locations. The 16 encoded symbols could even be spread across 16 different locations. In the EC 10/16 example, the original message could be reconstructed from 10 verified encoded symbols.
In a storage system, reliability and efficiency are two main concerns. One of the main objectives of distributed storage or cloud storage is to ensure the reliable protection of user data. The reliability of the protection of user data may be referred to as Partition Tolerance within the Consistency-Availability-Partition Tolerance (CAP) terminology. Many storage systems trade off Availability against Consistency or vice versa, but not against Partition Tolerance. However, reliability and efficiency are often conflicting goals. Greater reliability may be achieved at the cost of reduced efficiency. Higher efficiency may be attained at the cost of reduced reliability. De-duplication is typically used to reduce unwanted redundancy, while erasure coding inserts a controlled redundancy in a data storage system to meet a durability constraint on the data storage system's operation. Thus, conventionally there has been no reason to use deduplication and erasure coding together. Some approaches to using deduplication and erasure coding together include the erasure coding of deduplicated data. However, if a data storage system is suffering from the chunk look-up disk bottleneck problem, resources devoted to erasure coding may go unused while waiting for the chunk look-up disk bottleneck to resolve, deduplication throughput may be reduced, and time may be wasted. Thus, some approaches to using deduplication and erasure coding together offer sub-optimal performance and use of resources.