Data deduplication is a technique that may be used to reduce the amount of storage space used in a single-instance data storage system by detecting and preventing redundant copies of data from being stored to the single-instance data storage system. For example, data deduplication is often used to reduce the amount of storage space needed to maintain backups of an organization's data.
In order to perform data deduplication, a system needs to be able to identify redundant copies of the same data. Because of the processing requirements involved in comparing each incoming unit of data with each unit of data that is already stored in a single-instance data storage system, the detection is usually performed by the system generating and comparing smaller data signatures (“fingerprints”) of each data unit instead of comparing the data units themselves. The detection generally involves generation of a new fingerprint for each unit of data to be stored to the single-instance data storage system and comparison of the new fingerprint to existing fingerprints of data units already stored by the single-instance data storage system. If the new fingerprint matches an existing fingerprint, a copy of the unit of data is likely already stored in the single-instance data storage system.
Existing data deduplication techniques often require significant computing resources, especially for single-instance data storage systems storing large amounts of data and/or for requests to store large volumes of data to a single-instance data storage system. For example, existing client-side data deduplication techniques often use significant bandwidth resources to transport large numbers of queries from a client-side device to a server-side single-instance data storage system. In particular, with existing data deduplication techniques, initialization of a client-side cache to be used for client-side data deduplication may require that numerous queries be transmitted to the server-side single-instance data storage system.