Data deduplication reduces the overall amount of data storage required to represent and retain data by identifying duplicate portions of the data and replacing those duplicate portions with pointers to existing copies of that data. Presently, solutions that provide data deduplication capability use a static and consistent method for all data that is processed, with no differentiation based on the data being deduplicated. As a result, various algorithm parameters used in deduplicating the data are established up front in an attempt to provide the best overall performance, in terms of metrics such as: data deduplication ratio: write throughput; read throughput; and resource consumption and overhead (e.g., CPU cycles, memory and disk capacity consumed/needed).
Presently, tradeoffs are necessitated in order to establish the parameters of a deduplication algorithm (such as average block size) or choice of the particular deduplication algorithm that is used. File-based deduplication (e.g., Single Instance Storage (SIS)) may be chosen to maximize throughput performance, minimize overhead, but sacrifice deduplication ratios. Block-based deduplication may be chosen to achieve better deduplication, but at the expense of throughput. Within block-based deduplication, fixed-length or variable-length methods may be used, and average lengths of greater or smaller size may be used, with each predetermined choice having additional tradeoff considerations (e.g., deduplication ratio vs. block size; indexing overhead; and long-term fragmentation effects, to name a few).
As but one example, a conventional deduplication solution might settle in advance on using a variable length block-based approach to deduplicating all data that is being stored in a data storage system, such as a virtual tape library or an automated storage system. Based on the chosen solution, all data will be processed in the exact same way with the same static parameters, with the output of all deduplication being stored in a single blockpool. A blockpool is a repository for storing a (typically) unique sequence of data known as a “block” that is referenced to find/replace redundant instances of the sequence. A block may be large, such as an entire file, or small such as a number of sequential bytes in a file or data stream. A sub-file sized portion of data is still a block, but may be referred to as a blocklet. The characteristics of this single blockpool are, likewise, determined in advance (often based upon tradeoffs) and remain static for all data that is processed. Some examples of these static characteristics include the mean block length, minimum block length, and maximum block length.
The drawings referred to in this brief description should be understood as not being drawn to scale unless specifically noted.