Data deduplication is a technique used to reduce the overall amount of data storage required to represent and retain data. In general, data deduplication works by identifying duplicate portions of the data being stored and replacing those duplicate portions with pointers to existing stored copies of that data. In this manner, a unique sequence of data identified by a deduplication engine is only required to be stored a single time.
A deduplication index (also sometimes referred to herein as a primary index) in a deduplication engine is a data structure used for storing signature values, such as hash values, that are associated with sequences of data that are being stored. These sequences of data are often small portions of a larger file or a data stream and are referred to as blocklets. Copies of unique blocklets are typically stored in a blockpool which may reside in mass storage such as on hard disk drive or storage area network. A pointer to an address/location in a blockpool can be stored in the primary index to point from the signature of a blocklet to the actual storage location of the data that comprises it.
One of the bottlenecks in bulk data matching tasks, such as data deduplication, is access to the primary index. This bottleneck exists because in many storage technologies, such as disk drives and even solid state storage, random data access is much slower than sequential data access. Because of this slowness, content-driven lookup, such as looking up or searching for a signature of a blocklet in a primary index, can take considerable time, as it is an inherently random process. The slowness of random access is compounded by the fact that primary indices can often be very large.
During deduplication, a signature value, such as a hash value for a blocklet of data being deduplicated, may, initially be looked up in the primary index of a deduplication engine. In some embodiments, the primary index then references a storage location outside of the primary index, but typically still within the deduplication engine, such as a cluster header. This “outside storage location” typically comprises, or is closely related to, a sequential representation of blocklets of a previously duplicated data stream (also sometimes referred to herein as a “data sequence”). After finding an initial blocklet's signature in this sequential representation, a time savings and computational savings is realized if the signature for the next blocklet from the data stream that is currently being deduplicated happens to be a sequential repetition of the previously seen data stream that is represented in the outside storage. An instance where this sequential matching occurs offers time and computational savings by precluding the deduplication engine from expending the time and computational resources required to search for the signature by random access through the primary index. As a fair amount of stored data tends to be data that is repetitive in nature, such use of storage outside the primary index can generate an overall gain in deduplication efficiency.
An incoming data stream can have a different sequence from the previously stored data, even though the content of the data has not changed. Efficient data deduplication under these conditions is extremely challenging. For example, data can be moved or shuffled around, so that one set of previously-seen data has been inserted between two other formerly consecutive sets of previously-seen data. In this example, because the data stream is typically parsed into blocklets in a somewhat random manner which is irrespective of the actual data files in the data stream, one blocklet can consist of the tail end of a first set of data and the front end of the inserted, second set of data. This type of blocklet, which is comprised wholly of previously-seen data, albeit not in the same order, is referred to herein as a transition blocklet since the blocklet covers the transition from the first set of data (e.g., a first data file) to the second set of data (e.g., a second data file). During the deduplication process, this type of transition blocklet is in itself unrecognizable since no signature has previously been assigned to it. Stated another way, these novel transition blocklets are viewed as never-before-seen blocklets. Thus, for each occurrence of a new transition blocklet a random access search in the primary index is required, which expends excessive time and computational resources. Once it has been determined that the transition blocklet has not been previously recognized by the deduplication engine, the transition blocklet is stored in the same manner as a new data blocklet.
Additionally, it is impractical, if not impossible, to predict the signature of the blocklet that immediately follows a transition blocklet just by looking at the transition blocklet. Therefore, in conventional deduplication systems, an additional random access search in the primary index is required. In this example, the search process outlined above typically must be repeated for each successive transition blocklet, as well as (at least) for the blocklet that immediately follows the transition blocklet, again expending excessive time and computational resources.