Data deduplication (Data Deduplication), also known as duplicate data elimination (Duplicate Data Elimination), is a process of identifying and eliminating duplicate content in a data set or a data stream to improve the efficiency of data storage and/or data transmission, and is called deduplication or duplicate elimination for short. Generally, in a deduplication technology, a data set or a data stream is divided into a series of data units, and retains only one data unit that is duplicate, thereby reducing a space cost in a data storage process or bandwidth consumption in a data transmission process.
How to divide a data object into data units where duplicate content can be easily identified is a key issue that needs to be solved. After a data object is divided into data units, a hash value h(⋅) of a data chunk may be calculated as a fingerprint, and the data units with a same fingerprint are defined as duplicate data. In the prior art, a commonly used data unit for deduplication includes a file, a fixed-length block (Block), a content-defined variable-length chunk (Chunk), and the like. A content defined chunking (Content Defined Chunking, CDC) method adopts a sliding window to scan data and identify a byte string that complies with a preset characteristic, and to mark a position of the byte string as a chunk boundary, so as to divide a data set or a data stream into variable-length chunk sequences. In the method, a chunk boundary is selected based on a content characteristic of data, and can more acutely identify a data unit shared by similar files or data streams, and therefore, the method is widely applied in various data deduplication solutions. According to a research, when the content defined chunking method is adopted to divide a data set or a data stream, a finer chunking granularity means a higher probability of identifying duplicate data and a better deduplication result. However, a finer chunking granularity means a larger number of chunks to be divided from a given data set, thereby increasing an indexing overhead and the complexity of searching for duplicate data. As a result, the time efficiency of data deduplication is reduced.
An expected length is a key parameter for a content defined chunking (Content Defined Chunking, CDC) method to control the chunking granularity. Generally, a CDC method outputs a variable-length chunk sequence for a specific data object, where lengths of various chunks are statistically subject to normal distribution, and the expected length is used to adjust an average value of the normal distribution. Generally, the average value of the normal distribution is represented by an average chunk length. Because a random variable is assigned an average value at a highest probability under the normal distribution, the average chunk length is also called a peak length and may equal the expected length in an ideal circumstance. For example, in the CDC method, a fingerprint f(w-bytes) of data within a sliding window is calculated in real time. When certain bits of the f(w-bytes) match a preset value, a position of the sliding window is selected as a chunk boundary. Because an update of data content may result in a random change of a hash fingerprint, if f(w-bytes) & 0xFFF=0 is set as a match condition, where & is a bit-AND operation in a binary field, and 0xFFF is a hexadecimal expression of 4095, one fingerprint match may theoretically occur in 4096 random changes of f(w-bytes), that is, a chunk boundary can be found each time the sliding window slides 4 KB (4096 bytes) forward. The chunk length under an ideal circumstance is an expected chunk length (Expected Chunk Length) in the CDC method, and is called an expected length for short.
To reduce the number of chunks as much as possible while maintaining the space efficiency of deduplication, the prior art provides a content defined bimodal chunking method. The core idea of the content defined bimodal chunking method is to adopt a variable-length chunking mode with two different expected lengths: when dividing a file into data chunks, determine duplication of candidate chunks by querying a deduplication storage system, and adopt a small-chunk mode in a region of transition between duplicate data and non-duplicate data and a large-chunk mode in a non-transitional region.
However, the technology cannot work independently, and when determining how to chunk a data object, a chunk computing device needs to frequently query the fingerprint of a data chunk existing in a deduplication storage device, where the deduplication storage device stores a data chunk where data deduplication has been performed, determine, according to the duplication of candidate chunks, whether there is a region of transition between the duplicate data and the non-duplicate data, and then determine which chunking mode is adopted finally. Therefore, the prior art causes query load pressure to the deduplication storage device.