The present invention relates to the technical field of file deduplication. More specifically, the present invention relates to the use of hash-trees with adaptive resource utilization for deduplication.
Typically, it is desirable for multiple different clients, which may be completely unaware of each other, to store data on the same storage system. This allows the entity providing the storage system to more efficiently utilize storage space, often by removing duplicate content. This reduction in overall storage usage by detecting replicated content, or deduplication, is an optimization technique employed by file systems.
Hashes are a compact representation storing shorter checksums to describe much larger data in a compact fashion. Even if hashes provide only negative proofs—different hashes mean different inputs—hash-based comparisons can be used to quickly identify potentially identical input blocks. With sufficiently large hash output, chances of hash collisions are sufficiently low as to shorten the list of potentially matching candidates considerably, even if subsequent bytewise comparison is needed for absolute certainty. Hash trees extend hashing to structures with O(log N) hierarchical levels of hashes for N input blocks, allowing updates in sub-linear speed.
Hash-based comparisons are frequently used for large systems to provide fast equality comparisons, and many deduplication methods utilize hash trees to detect identical blocks locally, within the same storage node. However, due to conflicting requirements of high-performance, dynamic hash data structures and cross-network transparency, file system wide global deduplication based on hash trees is not used by distributed file systems (DFSes). Some state-of-the-art file systems use only block-by-block comparison, not arranging hashes into hash trees. Note that btrfs (B-tree file system) hash-storage implementation problems are concentrated around efficient storage of hashes within existing linear metadata, which basically prohibits organizing hashes into hash trees. None of the distributed file systems searches for replicated blocks in a global hash tree of all DFS content.
Since DFSes may vary in both in static and dynamic size, one may not simply scale up hash trees from local nodes. In addition to raw sizes, a DFS may elect to change resource usage radically at runtime, such as increasing memory usage during recovery. An efficient hash-based, DFS-wide deduplication system must be able to adapt to such dynamic changes within reasonable limits. Therefore, hash-assisted deduplication may need to prioritize adaptivity over raw efficiency.
Practical limitations on DFS sizes imply that even very short truncated hashes—starting at 32 bits—radically reduce hash collision probabilities. Increasing hash sizes essentially increases hash tree storage linearly: storage of the lowest-level hashes dominates total size, with higher hash levels contributing little (even if superlinear, but with a low constant). At the same time, increasing hash sizes reduces the chance of hash collisions exponentially. Therefore maximizing hash sizes for comparisons is of paramount importance. These two contradictory design goals need to be balanced.
For an order-of-magnitude comparison, individually numbering all atoms in the Solar System would require approximately 192 bit indexes. The largest known comparable single-image DFS as of this writing, at CERN, contains approximately 232 disk blocks. Under these conditions, a minimum hash size of 32 bits would be a reasonable lower bound, and allowing at least 128 bits as an upper limit would allow the system to scale to any foreseen DFS, with a negligible chance of hash collisions, even after truncation.
While there is considerable hash-assisted deduplication work in some state-of-the-art file systems, many of these attempts are limited to local storage, and do not scale to DFSes. As a general theme, most approaches inheriting local-system limitations fail to scale to DFSes without major inherent scalability limits. At a lower level, deduplication which only works for data within predefined limits is entirely unsuitable with DFSes, which are by nature dynamic and volatile.
Even in some file systems which integrate multiple storage devices in one system, block-by-block comparisons are used, without arranging hashes into hash trees. This is usually a design limitation, when hashes are tied to disk blocks, which makes it inconvenient to reorganize them in memory-resident hash structures. When deduplication is restricted to statically sized, worst-case data structures, the fixed constraint may prohibit deployment, even if results could be obtained through use of smaller data structures. Tolerating the cost of additional checks has been shown to work with inexact data structures in large distributed systems.
As such, a distributed content deduplication system using hash-trees with adaptive resource utilization is needed.