It is common to have duplicate data content stored in a storage system. Consequently, minimizing duplicate content on disk storage systems—at both the file and block levels—has received a lot of attention from both academia and industry. Much of this research effort has been directed to deduplication storage systems or “single-instance stores” which, as the names imply, store only one copy of each unique data instance. Since deduplication is inherently difficult, much of this past work has been focused on improving the efficiency, scalability, and speed of in-line deduplication.
Deduplication storage systems are particularly useful for archival and backup purposes where there may be a large number of duplicates and where storage capacity is the major cost consideration making maximizing data storage a primary objective. On such systems, deduplication can provide excellent reductions in storage capacity, bandwidth, and power.
However, in primary storage systems—such as file servers and web servers that store user content, as well as personal and portable computer systems—reducing duplication is less beneficial since such systems may have only a relatively moderate degree of duplication and the dynamic and unpredictable workload characteristics of such systems inherently make deduplication all the more difficult to implement and achieve. Moreover, other metrics—such as performance and reliability—are more important in primary storage systems than capacity, and thus maximizing capacity is not a primary objective of such systems. Also, as the cost of storage continues to decline, the value of removing duplicates to save storage space continues to decline for both primary storage systems as well as archival and backup storage systems, further eroding the cost savings from deduplication.