Historically, computer files (or more generally, digital objects) have been stored in file systems. These file systems have typically been hierarchical, and have allowed files to be inserted, removed or retrieved according to a particular schema. Usually, such a file system is implemented using a B-tree and objects are stored along with metadata such as a file name and other attributes. The file identifier often conforms to a regular hierarchical path and files are stored and retrieved using path names.
This model of storing files, though, is reaching its limits as massive amounts of information are now being required to be stored within file systems. A single computer may store millions of files and computer servers in large networks may be required to store many times that amount of information. While a B-tree implementation (for example) may work fine with many thousands of files, a file system may process requests much more slowly as the number of files increase. New techniques of storing information have accordingly been developed.
Storage clusters have been developed where digital objects are stored in a flat address space across any number of computer nodes. A unique identifier for each object (such as a hash value taken over the object or a random number, for example) is used to add the digital object to, or retrieve it from, the storage cluster. With the proliferation of electronic mail, information being available in electronic form, mobile telephones, etc., greater quantities of digital information are being stored and inevitably, the same digital object may be stored many times in a computer system. For example, a single presentation sent around a corporate environment may be stored many hundreds of times on an e-mail server or within a long-term storage cluster, even though each copy of the presentation is exactly the same. Having multiple copies of the same digital object within a storage cluster wastes disk space, consumes CPU time, and generally makes the cluster less efficient. It would be advantageous to eliminate the unneeded copies.
If a hash value of the digital object is used as the unique identifier, then this hash value may be sent to the cluster before the digital object is stored in order to determine if the object is already present within the cluster. But, this technique (“in-line elimination of duplicates”) can bog down the input rate to the cluster as the hash value must be calculated using the entire object before the hash value can be sent. If an identifier other than a hash value is used (such as a random number) then in-line elimination would not be an option because there would not be a one-to-one correspondence between the identifier and the object (copies of the object may have different identifiers).
Accordingly, it would be desirable to eliminate duplicates of digital objects within a storage cluster regardless of the type of unique identifier used, and without bogging down the input rate to the cluster.