Data deduplication is also called intelligent compression or single instance storage, and is a storage technology in which duplicate data can be automatically searched for, only a unique copy is retained for data that has duplicate copies, and a pointer pointing to the single copy is used to replace other duplicate copies, so as to eliminate redundant data and reduce storage capacity requirements.
In the prior art, a data deduplication technology is widely applied to application environments such as backup and a virtual desktop. A data processing system includes multiple storage nodes, where each storage node has its own deduplication processing engine and storage medium, such as a hard disk. When data needs to be written into a file, the data is divided in a cache to obtain multiple data blocks. A fingerprint value of each data block is calculated, and some fingerprint values as samples from the fingerprint values of each data block are sent to all physical nodes in the data processing system to query. A target physical node with a largest quantity of duplicate fingerprint values is obtained from a query result, and information about all data blocks in a data group corresponding to sampled metadata information is sent to the target physical node for duplicate data query.
In a cluster deduplication technology in the prior art, a fingerprint value sample needs to be sent to all physical nodes for query, which causes too many times of interaction among the physical nodes in a deduplication process. In a case in which there is a larger quantity of physical nodes in a data processing system, when deduplication is executed by each physical node, a calculation amount increases with the quantity of physical nodes in the data processing system, thereby degrading system deduplication performance.