As data amounts keep growing, it becomes a critical challenge to provide sufficient data storage in the storage field currently. At present, a manner of addressing such a challenge is using a deduplication technology by means of a redundancy feature of data that needs to be stored, so as to reduce an amount of stored data.
In an algorithm of eliminating duplicate data based on a content defined chunk (CDC) in the prior art, a data stream to be stored is first divided into multiple data chunks. To divide a data stream into data chunks, a suitable dividing point needs to be found in the data stream, and data between two adjacent dividing points in the data stream forms one data chunk. A feature value of a data chunk is calculated, so as to find whether data chunks having a same feature value exist. If the data chunks having a same feature value are found, it is regarded that duplicate data exists. Specifically, in a technology of eliminating duplicate data based on a content defined chunk, a sliding window technique is applied to search for a dividing point of a chunk based on content of a file, that is, a Rabin fingerprint of data in a window is calculated to determine a data stream dividing point. It is assumed that a dividing point is searched for from left to right in a data stream. A fingerprint of data in a sliding window is calculated each time, and after a modulo operation is performed on a fingerprint value based on a given integer K, a result of the modulo operation is compared with a given remainder R. If the result of the modulo operation equals the given remainder R, the right end of the window is a data stream dividing point. Otherwise, the window continues to be slid rightward by one byte, and calculation and comparison are performed sequentially and cyclically until the end of the data stream is reached. In a process of eliminating duplicate data based on a content defined chunk, a large quantity of computing resources need to be consumed to search for a data stream dividing point, which therefore becomes a bottleneck in improving performance of eliminating duplicate data.