1. Field of the Invention
The present invention relates generally to data storage, and more specifically to techniques for distributing segments of data among multiple storage repositories.
2. Description of the Related Art
Capacity optimization schemes factor objects into parts and plans, and store the parts and plans in an optimized store or repository. Parts that are unique, i.e., have not been previously encountered, are stored. Before a new part is added to the optimized store, the part is compared with the parts already located in the store. Direct comparison is generally too expensive, computationally, to provide useful performance from the factoring process. Instead, the parts are “fingerprinted” using, for example, a cryptographic checksum. When a part is added to the optimized store, its checksum is added to an index of checksums. Before a new part is added to an optimized store, the index is examined. If the checksum is found in the index, then the part is determined to be already in the optimized store and is not added again. To provide useful factoring performance, the index of checksums is typically maintained in computer memory.
Capacity optimization schemes operate on arbitrary binary electronic data by dividing the “original” i.e., initial, unprocessed data into multiple segments, which correspond to the “parts” described above. The segments may vary in size, and the segment boundaries are typically chosen with the goal that the segments correspond to the most repeated, i.e., duplicated, portions of the original data. Then the segments are stored in a segment pool in a repository, i.e., a database, using a relatively small numeric “key” value as a lookup value. The key value may be, for example, a hash value based on data in the segment. The key value is essentially a unique identifier for the segment. A de-duplicated data object. e.g., a file, which corresponds to the “plan” described above, is created, and the key value of each segment is added to the de-duplicated data object in succession. The segment values need not be written to the de-duplicated data. Instead, the segments are stored in the repository, and each segment is stored only once (hence the term de-duplication). Note that data-duplication schemes are also referred to as reduced-redundancy systems.
The de-duplicated data object may, for example, include a file, a data structure in memory, or an object. Each segment need only be stored in the repository once, because the references to that segment in the de-duplicated data object can all be resolved by retrieving the same segment. The de-duplicated plus the data stored in the repository is typically significantly smaller in size than the original data. The original data can be regenerated by sequentially reading the de-duplicated data, i.e., the sequence of key values, retrieving the segment for each key value from the repository, and concatenating the key values together in the order of retrieval.
The Blocklets™ technology of RockSoft (now part of Quantum Corporation of San Jose, Calif.) is an example of a de-duplication scheme. The Blocklets™ scheme uses the term “blocklet” for the segment described above. Blocklets are stored in a “blocklet pool” in a repository similar to the segment pool described above.
When the segment is stored in the segment pool in a repository, an entry is typically made in an index that is also stored in the repository. The size of the index in a data de-duplication scheme may grow to be very large. If the original data is relatively large, then many thousands or millions of segments may be created, and the size of the index may grow beyond the capacity of typical computer systems. The index may become so large that insertions of new segments into the repository and retrieval of existing segments take a substantial amount of time, and performance will consequently suffer. Furthermore, the index or the segment pool may become very large and possibly exceed the storage capacity of a single storage device or server. In particular, the index is typically stored in computer memory for efficient access, and is therefore limited in size by the size of the computer's memory. The access time characteristics of a single index and repository may not scale well as the number of accesses increases, e.g., as a result of large numbers of users or high frequency of access. That is, the index and repository become a bottleneck to system throughput as the usage load increases. Furthermore, the index and repository are a single point of failure in that a failure of the computer or storage system providing the index and repository may lead to temporary or permanent loss of the original data. Therefore, when the amount of data stored by a de-duplication scheme is large, or the frequency of access is high, it would be desirable to be able to increase the performance and reliability of data de-duplication schemes.