Data reduction methods such as compression and deduplication are commonly used in storage systems to reduce the required storage capacity and thus storage cost. Data compression captures “local” scope such as at the block or file level, while deduplication considers a “global” scope. Deduplication enables reducing storage capacity by storing only one copy of duplicated data written to a storage system. In distributed storage systems, different clients will often store the same data. By removing duplicate content, the storage system can improve efficiency in terms of storage space.
However, storage environments in large scale systems are often composed of a set of deduplication domains, where each domain manages its own data independently from other systems (and specifically handles deduplication internally). In many such environments, the storage management layer that allocates data entities, such as storage volumes, does not take into account the content sharing among volumes when assigning new volumes to deduplication domains. Thus, content sharing across these domains is not exploited, which can result in significant deduplication loss. Even when the management layer considers content sharing among volumes in these environments, there is often no knowledge of the future content that would be written to it when creating a volume. Thus, content sharing is not exploited.