1. Field
This invention relates generally to data storage, and more specifically to systems and processes for data de-duplication.
2. Related Art
Data de-duplication is a technique used in data storage to increase storage efficiency by detecting and removing redundant data. Only unique blocks of data are actually stored in a repository, such as one or more disks or tapes. Redundant data is typically replaced with pointer references to the unique data copy.
Data de-duplication operates by segmenting a dataset, e.g., a stream of backup data, into a sequence of unique sub-blocks and writing those blocks to a disk target or repository. Each sub-block is assigned a unique identifier, e.g., a hash value, based on the data within the sub-block. This identifier is typically stored in an index (a sub-block index) that maps the sub-block's identifier to the location in the repository where the sub-block is stored. A duplicate sub-block within the dataset is detected when the sub-block's identifier matches one of the identifiers in the sub-block index. Instead of storing the sub-block again, a pointer to the original sub-block may be stored in the dataset's metadata (data about the dataset), thereby improving storage efficiency.
Although each identifier occupies very little space, the sub-block index may potentially contain a very large number of identifiers. Therefore, the sub-block index may be too big to fit into memory; instead, it is stored on a storage medium that typically has slower random access time, e.g., a disk. As a result, sub-block index lookups may be costly in terms of access time.
The number of sub-block index lookups may be reduced by caching. For example, a cache may be used to store recently added sub-blocks, recently matched sub-blocks, the most popular sub-blocks, or the like. However, these methods do not reduce the number of sub-block index lookups for sub-blocks that are less common or have not been recently seen.