In typical file systems, stored items are retrieved based on (a) the location at which the items are stored, and (b) a name or identifier of the items. For example, if a file named “foo.txt” is located in a directory named “c:\myfiles\text”, then applications may use the pathname “c:\myfiles\text\foo.txt” as the access key to retrieve the file from the file system.
Because conventional access keys are based on the location of the items being retrieved, the access keys change when the items are moved within a directory structure of a file system. In addition, each copy of an item has a different access key, because each copy is stored at a different location. On the other hand, when the content of the item is changed, the access key remains the same.
In contrast to conventional file systems, content-addressable storage systems allow applications to retrieve items from storage based on data that is generated from the content of the items, such as a hash value for the content. Because content-addressable storage systems perform storage-related operations on items based on the hash values generated for the items, and the hash values are based on the content of the items rather than where the items are stored, the applications that request the operations may do so without knowing the number or location of the stored copies of the items. For example, a content-addressable storage system may store multiple copies of an item X at locations A, B and C. An application that desires to retrieve item X would do so by sending a request with a hash value based on the contents of item X. Based on that hash value, the content-addressable storage system would provide to the application a copy of item X retrieved from one of the locations A, B, and C. Thus, the application would obtain item X without knowing where item X was actually stored, how many copies of item X existed, or the specific location from which the retrieved copy was actually obtained.
A chunk storage system is a storage system that performs storage operations without understanding the format or content of the digital information itself. Such storage systems are referred to as chunk storage systems because the systems treat all forms of digital items as if those items were merely opaque chunks of data. For example, the same chunk storage system may be used by word processing applications, image management applications, and calendaring systems to respectively store documents, images and appointments. However, from the perspective of the chunk storage system, only one type of item is being stored: opaque chunks of digital information.
Chunk storage systems may be implemented as content-addressable storage systems. For example, a chunk storage system may generate a hash value for a chunk by applying a cryptographic hash function (e.g. MD5, SHA-1 or SHA2) to the chunk. The chunk store may then store the chunk, and maintain indexing data that associates the hash value with the location at which the chunk is stored.
When an application subsequently requests retrieval of the chunk, the application provides the hash value to the chunk storage system. The chunk storage system uses the indexing data to locate the chunk associated with the hash value, and provides the chunk thus located to the requesting application.
When an item is represented by one or more chunks in a content-addressable storage system, additional chunk/s must be added to the content-addressable storage system when the item is modified. Because the access key is based on the content, the access key for any chunk corresponding to the modified item will be different from the access key for a chunk corresponding to the original item. Furthermore, references to the original item, such as hash values or other access keys, with only be usable to access the original item.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.