Data deduplication is a technique used to reduce the overall amount of data storage required to represent and retain data. In general, data deduplication works by identifying duplicate portions of the data being stored and replacing those duplicate portions with pointers to existing stored copies of that data. In this manner, a unique sequence of data identified by a deduplication engine is only required to be stored a single time.
A primary index in a deduplication engine is a data structure used for storing signature values, such as hash values, that are associated with sequences of data that are being stored. These sequences of data are often small portions of a larger file or a data stream and are referred to as a blocklets. Copies of unique blocklets are typically stored in a blockpool which may reside in mass storage such as on hard disk drive or storage area network. A pointer to an address/location in a blockpool can be stored in the primary index to point from the signature of a blocklet to the actual storage location of the data that comprises it.
One of the bottlenecks in bulk data matching tasks, such as data deduplication, is access to the primary index. This bottleneck exists because in many storage technologies, such as disk drives and even solid state storage, random data access is much slower than sequential data access. Because of this slowness, content-driven lookup, such as looking up or searching for a signature of a blocklet in a primary index, can take considerable time, as it is an inherently random process. The slowness of random access is compounded by the fact that primary indexes can often be very large. Thus, in some implementations of deduplication engines it is desirable to move some data that would conventionally be found only in the primary index to alternate storage locations outside of the primary index, that are physically adjacent to data blocks that are statistically likely to have been required in the recent past. Such storage outside a primary index can be referred to by many names. For example, in some implementations, cluster headers are used for such storage outside of the primary index. Cluster headers are often stored in mass storage, such as on disk, but are typically small enough to be loaded into a random access memory where they can be very quickly accessed.
During deduplication, a signature value, such as a hash value for a blocklet of data being deduplicated, may initially be looked up in the primary index of a deduplication engine. In some embodiments, the primary index then references an outside storage location, such as a cluster header (it is also possible in some embodiments that the lookup may begin in a location outside the primary index). In some embodiments, this outside storage location includes portions of the information content of the primary index. For example, this storage outside the primary index can include blocklet signature values and associated pointers to the actual data of blocklets (e.g. a pointer from a signature value to an address in a blockpool where the actual data of a blocklet is stored).
Alternatives are possible, but an outside storage location typically comprises, or is closely related to, a sequential representation of blocklets of a previously deduplicated data stream. After finding an initial blocklet's signature in this sequential representation, a time savings and computational savings is realized if the signature for the next blocklet from the data stream that is currently being deduplicated happens to be a sequential repetition of the previously seen data stream that is represented in the outside storage. An instance where this sequential matching occurs offers time and computational savings by precluding the deduplication engine from expending the time and computational resources required to search for the signature by random access through the primary index. As a fair amount of stored data tends to be data that is repetitive in nature, such use of storage outside the primary index can generate an overall gain in deduplication efficiency.
The drawings referred to in this brief description should be understood as not being drawn to scale unless specifically noted.