De-duplication reduces redundant data by storing a single copy of data, typically to improve storage utilization. De-duplication may involve dividing a larger piece of data into smaller pieces of data. Larger pieces of data may be referred to as “blocks” while the smaller pieces of data may be referred to as “sub-blocks” or “chunks”. Dividing blocks into sub-blocks may be referred to as “chunking”. De-duplication may be referred to as “dedupe”.
In a distributed network, many nodes (e.g., computers, data streams, applications) may be performing data de-duplication for a requesting entity. Conventionally, simple metrics have been used to select a node to perform de-duplication. Node selection may simply be based on which node responds first to a request for de-duplication or geographical proximity. However, the node that responds first to a request might have a wildly different approach to chunking, de-duplication, or determining uniqueness than a node that responds second to the request.
Different chunking approaches may affect the amount of time it takes a node to divide a block into sub-blocks. Additionally, different chunking approaches may lead to more or less data reduction through dedupe. Therefore, a node may be characterized by different chunking approaches that yield varying levels of performance (e.g., time), data reduction (e.g., percent), and balance between performance and data reduction. Thus, some nodes may perform more efficiently than other nodes in the distributed network based on the de-duplication conditions or identity of the requesting entity.
By way of illustration, some nodes perform chunking quickly but inefficiently, resulting in less data reduction, while other nodes perform slowly but more efficiently. For example, a variable sized chunking approach that considers multiple possible boundaries per chunk may take more time to perform but may yield substantial data reduction. In contrast, a fixed size chunking approach that considers only a single fixed size sub-block may take less time to perform but may yield minimal, if any, data reduction. So, there may be a tradeoff between performance time and data reduction. Thus, nodes may perform chunking differently based on chunking factors (e.g., time to perform, data reduction).
Chunking approaches are only one example of how nodes can vary. Once a sub-block has been created, there are different dedupe approaches for determining whether the sub-block is a duplicate sub-block, whether the sub-block can be represented using a delta representation, whether the sub-block is a unique sub-block, and so on. For example, one node may determine whether a sub-block is unique by hashing the sub-block and comparing the hash to hashes associated with previously encountered and/or stored sub-blocks. Different hashes may yield more or less unique determinations due to, for example, a collision rate associated with the hash. Another node may use an alternative approach for determining whether a sub-block is unique that involves sampling the sub-block and making a probabilistic determination based on the sampled data. For example, if none of the sample points match any stored sample points, then the sub-block may be unique, while if a certain percentage of sample points match stored sample points then the sub-block may be a duplicate. Since different hashing and sampling schemes may yield more or less unique determinations, nodes having different hashing and sampling approaches may also have different performance levels and may yield different amounts of data reduction.
Once a block is chunked into sub-blocks and compared to previously stored sub-blocks, it can be determined if a sub-block needs to be stored. Nodes may charge a per sub-block fee for storage or charge fees to access stored blocks. Thus, one node may have more cost effective storage than another node in the distributed network. For example, while a first node may have previously stored many of the sub-blocks to be de-duplicated, if the cost per sub-block storage is disproportionately high, it may be more efficient to have a second node store sub-blocks, even if more sub-blocks need to be stored.
While a node's response time to a request and geographical proximity may be relevant to selecting a node, a more sophisticated choice can be made based on one or more affinity factors (e.g., chunking approach, data de-duplication approach, uniqueness determination, total cost).