Information that is used to access a stored digital item is referred to herein as the “access key” of the stored item. In typical file systems, stored items are retrieved based on (a) the location at which the items are stored, and (b) a name or identifier of the items. For example, if a file named “foo.txt” is located in a directory named “c:\myfiles\text”, then applications may use the pathname “c:\myfiles\text\foo.txt” as the access key to retrieve the file from the file system. Because conventional access keys are based on the location of the items being retrieved, the access keys change when the items are moved. In addition, each copy of an item has a different access key, because each copy is stored at a different location.
In contrast to conventional file systems, Content Addressable Storage (CAS) systems allow applications to retrieve items from storage based on a hash value that is generated from the content of the items. Because CAS systems perform storage-related operations on items based on the hash values generated for the items, and the hash values are based on the content of the items rather than where the items are stored, the applications that request the operations may do so without knowing the number or location of the stored copies of the items. For example, a CAS system may store multiple copies of an item X at locations A, B and C. An application that desires to retrieve item X would do so by sending to the CAS system a hash value that is based on the contents of item X. Based on that hash value, the CAS system would provide to the application a copy of item X retrieved from one of the locations A, B, and C. Thus, the application would obtain item X without knowing where item X was actually stored, how many copies of item X existed, or the specific location from which the retrieved copy was actually obtained.
Storing a digital item, such as a file or a message, often involves making a call to a “chunk storage system”. A chunk storage system is a storage system that performs storage operations without understanding the format or content of the digital information itself. Such storage systems are referred to as chuck storage systems because the systems treat all forms of digital items as if those items were merely opaque chunks of data. For example, the same chunk storage system may be used by word processing applications, image management applications, and calendaring systems to respectively store documents, images and appointments. However, from the perspective of the chunk storage system, only one type of item is being stored: opaque chunks of digital information.
Chunk storage systems may be implemented as CAS systems. For example, a chunk storage system may generate a hash value for a chunk by applying a cryptographic hash function (e.g. MD5, SHA-1 or SHA2) to the chunk. The chunk store may then store the chunk, and maintain an index that associates the hash value with the location at which the chunk is stored. When an application subsequently requests retrieval of the chunk, the application provides the hash value to the chunk storage system. The chunk storage system uses the index to locate the chunk associated with the hash value, and provides the chunk thus located to the requesting application.
Chunk storage systems may be configured in a variety of ways. U.S. application Ser. No. 13/358,742 describes how larger composite chunk stores may be composed using various types of building block chunk stores. The intended use of a chunk store is one factor in determining what types of building block chunk stores to use, and how those chunk stores should be arranged.
A chunk storage system that is configured to store different chunks at different building block chunk stores is a form of distributed hash table, where the hash value produced by applying the hash function to the chunk determines which building block chunk store will ultimately store the chunk. For example, consider a simple scenario that includes two chunk stores CS1 and CS2. Assuming that the hash function produces hash values between 0 and 1,000,000, chunks that hash to a value that falls between 0 and 500,000 may be stored at CS1, while chunks that hash to a value that falls between 500,001 and 1,000,000 may be stored at CS2.
The full range of values to which chunks may hash (e.g. 0 to 1,000,000) is referred to as the “hash space”. A portion of the hash space is referred to as the “hash segment”. In a system with multiple building block chunk stores, different hash segments may be assigned to different building block chunk stores. In the example given above, CS1 is assigned the hash segment of 0 to 500,000, and CS2 is assigned the hash segment of 500,001 to 1,000,000.
Because different chunk stores are assigned different hash segments, a chunk store system must be able to determine, based on the hash value generated by a chunk, which chunk store needs to be involved in an operation on that chunk. While maintaining a hash-segment-to-chunk-store mapping is relatively straightforward in a steady state, it becomes increasingly difficult to maintain when new chunk stores are added to the system and/or existing chunk stores fail. In either case, it may be necessary to revise the hash-segment-to-chunk-store mapping. To maintain consistency, any such change in the mapping may also require chunks to be redistributed among the chunk stores. Such redistribution operation may be expensive both with respect to time and computer resources.
Thus, in a traditional Distributed Hash Table (“DHT”) that stores chunks and retrieves chunks using CAS, it is difficult to dynamically add storage to the DHT to offer more overall storage, while minimizing down time, optimizing response time, and maximizing storage utilization rate. Traditionally, adding storage in such an environment is done by associating the storage media, typically disks, to hash ranges, and dynamically rearranging the hash ranges when new disks comes online. Various techniques have been proposed and used to split the hash ranges when they reach a certain threshold, and move chunks from one disk to another to equalize disk utilization. However, such techniques inevitably incur downtime while chunks are redistributed and mappings are updated.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.