The present invention relates generally to storage systems and, more specifically, to scalable graph modeling of metadata for deduplicated storage systems.
Digitization of large volumes of data and an increase in the richness of content of data have led to high demands for data storage capacity. One way to counter this increasing need for data storage capacity is to add additional hardware resources. However, in the storage domain, the addition of more storage often results in a disproportionate increase in the total cost of ownership (TCO). Though the cost of acquisition has retreated as a result of reductions in hardware costs, the cost of management (e.g., administration, power/energy) has increased. Many companies are attempting to provide a better solution by using data footprint reduction techniques such as deduplication.
Data deduplication removes or minimizes the amount of redundant data in a storage system by keeping only unique instances of the data on storage. Redundant data is replaced with a pointer to the unique data copy. By reducing space requirements, deduplication reduces the need for new storage resources. Implementations of deduplication often lead to substantial savings, both in terms of new resource acquisition costs and management costs, thereby leading to a significant reduction in TCO. In backup environments, deduplication also lowers the network bandwidth requirements for remote backups, replication and disaster recovery by reducing the amount of transmitted data on the network.
The use of deduplication introduces challenges to storage management as storage objects (e.g., files) are no longer independent from each other due to content sharing between storage objects. When storage objects share content with each other, they cannot be managed independently because a management decision on one file may affect another file. For example, in a traditional tiered storage system, individual “old” files can be migrated to a “colder” tier (e.g., from disks to tapes) without affecting other files. However, in a deduplicated tiered storage system, old files may share content with other files that are not necessarily old, so the migration of a candidate file needs to consider other files that share content with the candidate file, which complicates the storage management tasks.
Understanding the sharing relationships between data objects in a deduplicated storage system is important in order to provide efficient data management, such as data placement and data retrieval.