Data deduplication may reduce the amount of storage space used in a single-instance data storage system by detecting and preventing redundant copies of data from being stored to the single-instance data storage system. For example, data deduplication is often used to reduce the amount of storage space needed to maintain backups of an organization's data.
Data deduplication involves identifying redundant copies of the same data. Because of the processing requirements involved in comparing each incoming unit of data with each unit of data that is already stored in a single-instance data storage system, redundant copy identification is usually performed by generating and comparing smaller data signatures (“fingerprints”) of each data unit instead of comparing the data units themselves. The detection of redundant copies generally involves generation of a new fingerprint for each unit of data to be stored to the single-instance data storage system and comparison of the new fingerprint to existing fingerprints of data units already stored by the single-instance data storage system. If the new fingerprint matches an existing fingerprint, a copy of the unit of data is likely already stored in the single-instance data storage system.
Existing data deduplication techniques often consume significant computing resources, especially for single-instance data storage systems storing large amounts of data. For example, client-side deduplication techniques, in which data is hashed on the client and only non-redundant data is sent to the server, may require a significant amount of client memory and/or processing resources. In order to reduce client-side resource consumption, server-side deduplication may be implemented, in which all data is transferred to the server, where the data is hashed and only non-redundant data is stored to a single-instance data storage system. However, server-side deduplication may consume a significant amount of network bandwidth and server memory and/or processing resources.
Traditional deduplication systems typically implement either client-side deduplication or server-side deduplication. Unfortunately, choosing between client- and server-side deduplication may be difficult because the performance of both processes can be unpredictable. For example, the performance of both processes may vary based on, among other factors, client load (e.g., other applications may run concurrently with deduplication), server load (e.g., multiple data streams may be concurrently ingested by the server), network load (e.g., concurrent network traffic), and/or the type (e.g., redundant or non-redundant) of data being deduplicated. Therefore, one deduplication process may or may not outperform the other in any given situation.