With development of technologies, the amount of information in the society increases sharply, and an increase in the amount of data that needs to be stored and consequent increases in storage capacity and storage costs have become an important problem that an enterprise needs to consider. A data de-duplication technology effectively reduces required storage capacity in scenarios such as a data backup scenario and reduces storage costs by storing only a unique instance for the same data appearing many times in stored data. In the data de-duplication technology, using multi-node concurrent data de-duplication to accelerate a processing rate of data de-duplication and improve performance of the data de-duplication has been proved to be an effective method.
In the multi-node data de-duplication solution, when querying a duplicate data block, each block needs to query all block records to confirm whether duplicate data exists. Therefore, the querying takes a long time when there is a large amount of data de-duplication. To improve the performance of data de-duplication, each block of a file with a relatively high similarity is compared with each block in a group that has a relatively high similarity with the file in order to perform data de-duplication in a group. In this way, only a block record in the group needs to be queried when a duplicate data block is queried, and the objective of improving the performance of the data de-duplication is achieved by compromising a limited de-duplication rate.
Although the multi-node data de-duplication based on group reduces data querying time, when a similarity analysis is performed on a file to determine a group, a fingerprint of each file needs to be queried and matched with fingerprints of all groups because a fingerprint of a file and fingerprints of all groups need to be queried and matched to determine a similarity. In addition, for the purpose of ensuring querying accuracy, when a similarity analysis is performed on each file, a file that saves a group fingerprint needs to be locked. As a result, multiple nodes cannot concurrently perform the matching querying, which is a performance bottleneck of multi-node data de-duplication of a group.