Data that requires frequent access is often stored in high performance primary storage to facilitate quick access times. By contrast, infrequently accessed data is often stored in slower secondary storage. While it may be desirable to have all data quickly accessible, primary storage is costly. Conversely, while secondary storage is more cost efficient, high access times mean that it is not ideal for all applications. Additionally, in either case, it is desirable to improve use of storage space to promote cost savings. Thus, tradeoffs between performance and reducing data volume are balanced to attain efficient data storage.
One approach to addressing the performance versus reduction tradeoff uses data de-duplication to reduce data storage volume. Data de-duplication involves eliminating redundant data to optimize allocation of storage space. De-duplication may involve dividing a larger piece of data into smaller pieces of data. De-duplication may be referred to as “dedupe”. Larger pieces of data may be referred to as “blocks” while the smaller pieces of data may be referred to as “sub-blocks” or “chunks”. Dividing blocks into sub-blocks may be referred to as “chunking” or parsing.
There are different approaches to parsing. In one approach, a rolling hash may identify sub-block boundaries in variable lengths. In another approach, instead of identifying boundaries for variable sized chunks using a rolling hash, parsing may be performed by simply taking fixed size sub-blocks. In a hybrid approach, a combination of rolling hash variable length chunks may work together with fixed sized chunks.
Different parsing approaches may take different amounts of time to sub-divide a block into sub-blocks. Additionally, different parsing approaches may lead to more or less data reduction through dedupe. Therefore, parsing schemes have been characterized by performance (e.g., time), and reduction (e.g., percent).
By way of illustration, some parsing can be performed quickly but leads to less reduction while other parsing takes more time but leads to more reduction. For example, a variable sized parsing approach that considers multiple possible boundaries per chunk may take more time to perform but may yield substantially more reduction. In contrast, a fixed size parsing approach that considers only a single fixed size sub-block may take less time to perform but may yield minimal, if any, reduction. Thus, there may be a tradeoff between performance time and data reduction.
Conventionally, different parsing, hashing, and/or sampling approaches may balance the tradeoff between performance and reduction in different ways. Aware of the different performance times and resulting data reductions, some dedupe schemes may first analyze the type of data to be deduped before deciding on an approach. Other predictive schemes may determine an approach based on the entropy of data to be processed. The different approaches may be based on a prediction of the resulting data reduction possible in a given period of time.