In many data systems, broadly viewed, a sender (a data source) uploads data to a receiver (a data processor) via a communications channel. An example of such a system is a data storage system; however, these data systems may include any system in which a receiver somehow processes data uploaded from a sender. The uploaded and processed data may include, but is not limited to, any type of textual, graphical, or image data, audio data (e.g., music and voice data), video data, compressed and/or encrypted data, and so on. In many such systems, large amounts of data may need to be uploaded from the sender to the receiver via the communications channel. However, communications channels generally have bandwidth constraints, while a goal of such data systems is to get as much usable data across the communications channel to the receiver as possible.
Data deduplication refers to techniques for reducing or eliminating redundant data in such systems, for example to improve storage utilization in a data storage system and/or to reduce bandwidth usage on the communications channel. As an example, in at least some data deduplication techniques applied to data storage systems, the storage of duplicate data to a data store may be prevented. To achieve this, units of data that already reside in the data store, and/or units of data that do not reside in the data store, may be identified, and only the units that do not reside in the data store are stored or updated in the data store. Data deduplication in this application may thus reduce required storage capacity since fewer or only one copy of a particular unit of data is retained.
One technique for data deduplication in data systems is to have the sender upload all data to be processed (e.g. stored, in a data storage system) at the receiver, and have the receiver identify units of data that are to be processed. However, this technique does not reduce bandwidth usage between the sender and the receiver.
A conventional technique for data deduplication that may reduce bandwidth usage is to have the sender identify units of data to upload to the receiver; only the identified units of data are uploaded from the sender to the receiver. FIG. 1 illustrates a conventional deduplication technique in which a sender (a data source) identifies and uploads units of data to a receiver (e.g., a data storage system). In this conventional deduplication technique, the sender 20 maintains data 22 and locally stored fingerprints 24. Locally stored fingerprints 24 may uniquely identify units of data 22 that have been uploaded to data store 12. A fingerprint 24 may, for example, be a hash of a unit of data 22. In block-based data systems (for example, block storage systems), a unit of data may, for example, be a 256 k-byte portion of a data block, a 1024 k-byte portion of a data block, or some other fixed or variable sized portion of a data block. In file-based systems, a unit of data may be a file, or a portion of a file similar to the portions in a block-based data system. When sender 20 has data 22 to be uploaded to receiver 10, a data upload manager 26 at sender 20 may extract fingerprint(s) for units of the data 22 to be uploaded and compare the extracted fingerprint(s) to locally stored fingerprints 24 to identify one or more units of data that have not been uploaded to receiver 10 (or that have previously been uploaded, but have since been modified locally). The data upload manger 26 may then upload the identified data unit(s) to receiver 10, which processes 12 the data unit(s), for example by storing the data units to a data store.
While this technique may reduce the bandwidth used in uploading data from the sender 20 to the receiver 10, the technique requires the sender 20 to maintain a dictionary of fingerprints 24. In many such systems, a local store or cache of data 22 maintained locally at sender 20 may include many gigabytes or terabytes of data. Thus, the dictionary of fingerprints 24 that must be maintained by sender 20 may be quite large. In addition, in some systems, a receiver 10 may serve multiple senders 20, and in these systems it is difficult to apply deduplication globally (e.g., to consistently apply deduplication across data stored by the receiver 10 for two or more data sources).
While embodiments are described herein by way of example for several embodiments and illustrative drawings, those skilled in the art will recognize that embodiments are not limited to the embodiments or drawings described. It should be understood, that the drawings and detailed description thereto are not intended to limit embodiments to the particular form disclosed, but on the contrary, the intention is to cover all modifications, equivalents and alternatives falling within the spirit and scope as defined by the appended claims. The headings used herein are for organizational purposes only and are not meant to be used to limit the scope of the description or the claims. As used throughout this application, the word “may” is used in a permissive sense (i.e., meaning having the potential to), rather than the mandatory sense (i.e., meaning must). Similarly, the words “include,” “including,” and “includes” mean including, but not limited to.