The approaches described in this section could be pursued, but are not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated herein, the approaches described in this section are not prior art to the claims in this application and are not admitted to be prior art by inclusion in this section.
In typical file systems, stored items are retrieved based on the location at which the items are stored, and a name or identifier of the items. For example, if a file named “foo.txt” is located in a directory named “c:\myfiles\text”, then applications may use the pathname “c:\myfiles\text\foo.txt” as the access key to retrieve the file from the file system. Because conventional access keys are based on the location of the items being retrieved, the access keys change when the items are moved within a directory structure of a file system. In addition, each copy of an item has a different access key, because each copy is stored at a different location. On the other hand, when the content of the item is changed, the access key remains the same.
In contrast to conventional file systems, content-addressable storage systems allow retrieval of items based on data that is generated from the content of the items, such as a hash value for the item. Because content-addressable storage systems perform storage-related operations on items based on the content of the items rather than a static location for a particular item associated with a particular filename, applications that request the operations may do so without knowing the number or location of the stored copies of the items.
A chunk storage system is a storage system that performs storage operations without understanding the format or content of the digital information itself. Such storage systems are referred to as chunk storage systems because the systems treat all forms of digital items as if those items were merely opaque chunks of data. For example, the same chunk storage system may be used by word processing applications, image management applications, and calendaring systems to respectively store documents, images and appointments. However, from the perspective of the chunk storage system, only one type of item is being stored: opaque chunks of digital information.
Chunk storage systems may be implemented as content-addressable storage systems. For example, a chunk storage system may generate an access key for a chunk based on its content, such as by applying a cryptographic hash function (e.g. MD5, SHA-1 or SHA2) to the chunk. The chunk store may then store the chunk and maintain indexing data that associates the hash value with the location at which the chunk is stored. When an application subsequently requests retrieval of the chunk, the application provides the hash value to the chunk storage system. The chunk storage system uses the indexing data to locate the chunk associated with the hash value, and provides the chunk thus located to the requesting application.
When an item is represented by one or more chunks in a content-addressable storage system, additional chunk/s must be added to the content-addressable storage system when the item is modified. Because the access key is based on the content, the access key for any chunk corresponding to the modified item will be different from the access key for a chunk corresponding to the original item. Furthermore, references to the original item, such as hash values or other access keys, with only be usable to access the original item, not the modified item.
A file system volume may include one or more files arranged in a folder hierarchy. To store such a file system volume as chunks in a content-addressable storage system, the folder hierarchy itself may be reflected in one or more stored chunks. For example, assume that chunk A represents a folder A, and that chunks B and C represent files within folder A. In this case, the chunk A that represents folder A may include access keys for chunks B and C, thereby reflecting the hierarchical relationship between folder A and files B and C. Such access keys may be used to navigate down the folder hierarchy. However, if a particular chunk is obtained without navigating through the folder hierarchy, such as in response to an index search, the problem arises of determining the position of the particular chunk in the folder hierarchy. Unlike a typical file system with a location-based access key, such as pathname “c:\myfiles\text\foo.txt”, the access key of a chunk does not include the position of the chunk in any folder hierarchy.
Furthermore, in a content-addressable storage system, when the contents of a particular file are modified, a new version of the file must be stored at a different address based on the modified content, causing the access key for the new file to change. When a file system hierarchy is represented in one or more chunks, chunks that contain the access key of the original file (i.e. chunks that correspond to items, in the hierarchy, that are above the item that corresponds to the modified chunk) must also be changed to the access key of the new file, causing the generation of additional new chunks in turn. Accordingly, modifying a single file may cause multiple chunks that reflect the hierarchical structure of the file system to change.
One or more indexes may be maintained for chunks of a file system volume stored in a content-addressable storage system. The indexes may identify the chunks of the file system volume by access key. Any such indexes must also be updated when new versions of chunks are stored in the content-addressable storage system. A naive implementation of updating an index involves responding to every chunk change by iterating through the hierarchy from the root chunk and reestablishing the index for all chunks in the file system volume. This is not a scalable solution since it involves iterating through every chunk in the file system hierarchy when any file is modified.