In data storage systems the ability to store large amounts of data as efficiently as possible is of paramount importance. One approach to storing data efficiently is to store files in clusters.
In deduplication file systems a file may be split into hundreds of millions of segments during the write process. A segment that has already been stored during an earlier write process is not re-written, but rather recorded in the file's offsets in order to optimize storage capacity utilization. Thus, in the context of deduplication storage systems, a cluster can be used to store segments of data to minimize the amount of searching and indexing required to retrieve a segment. Conversely, once files are stored on a deduplication storage system, it would be beneficial to identify clusters based on their similarity and relocate clusters with similar content to different deduplication systems to achieve the same effect.
Being able to identify clusters of files based on their similarity is useful for a number of other reasons as well. For example, files having segments that compress/deduplicate well can be co-located in the same compression/deduplication domains/partitions, or moved together to different compression/deduplication domains/partitions or to different machines/nodes in a cluster of machines/appliances. Conversely, files having segments that do not compress/deduplicate well can be moved to less expensive storage systems that do not use deduplication.
One of the challenges in using clusters to store files or to identify clusters of files already stored in a file system with data deduplication is to identify a hierarchy of clusters that will best serve the needs of the storage system. The hierarchy of clusters can be represented as a dendrogram, or tree-structure, in which each cluster ideally contains files that share a significantly large amount of content, and in which each successive level of the dendrogram represents increasing levels of granularity in the amount of content that is shared.
An optimum dendrogram is defined as one that maximizes the cohesion of the files stored in a given cluster, i.e., one in which the similarity of the files is great and the differences small. But for file systems that store millions of files, generating the optimum dendrogram is computationally intensive and difficult to implement.