Data deduplication (also known by other names such as “single-instance storage”, “capacity optimization” or “intelligent compression”) refers to reducing storage requirements by eliminating storage of redundant data. Under deduplication, only one unique instance of a piece of content is actually retained on storage media, and multiple objects can point to the single unique instance. For example, a file system might contain 100 instances of the same one megabyte file. If the file system is backed up or archived without deduplication, all 100 instances are saved, requiring 100 megabytes to store the same one megabyte of content 100 times. With deduplication, only one instance of the file is actually stored, and that instance is referenced 100 times. Deduplication is thus a useful methodology for data storage management.
As useful is deduplication is, several aspects of this technology could use improvement. One such issue is the difficulty in determining how much space is wholly owned by a given object or set of objects. Suppose a set of objects O[n] (i.e., objects O1, O2, . . . On) are written to a deduplication store (i.e., a storage application using deduplication). It would then be useful at some later time to determine how much space would be rendered freeable if object set O[n] were to be deleted. Without this information, space management of the deduplication store is very difficult. If the store approaches full capacity, the application can only react by deleting objects in sequence, observing how much space is actually freed as deletion proceeds. There is currently no deterministic way to predict how much space could be freed by deleting a particular set of objects.
A related issue involves the difficulty in determining the allocation cost of an object set (i.e., the storage effectively allocated to that object set). As with predicting how much space could be freed by deleting a particular set of objects, there is currently no deterministic way to determine the current allocation cost of an object set.
These two issues, the inability to effectively determine wholly-owned space and allocation cost for an object set, constitute a significant problem in current deduplication storage technology. It would be desirable to address these issues.