The basic premise of a data deduplication system is that data stored into the system must be analyzed, broken into pieces (commonly referred to as “chunks”), duplicate chunks identified (each unique chunk is given a digital signature strong enough to declare that two chunks with the same signature are actually the same data), and duplicate chunks eliminated. Normally, as the deduplication system breaks apart larger objects into chunks, it must keep track of the individual chunks which make up the larger object, so the larger object can be retrieved when desired.
Deduplication reduces space requirements by eliminating redundant chunks of data objects and replacing them with links or pointers to the single remaining chunk. Generally speaking, there must be some type of index or database to keep track of the chunks of a larger object so that the larger object can be reassembled and retrieved after deduplication has removed the redundant chunks. Furthermore, the database used to track the chunks is generally embedded within the deduplication system. In other words, the deduplication system knows about its objects and chunks, but does not generally share this chunk information with any other system.
The deduplication system may be embodied in a storage management system that spans multiple storage volumes and storage pools. For example, data may be sent by storage management clients or data protection agents to the storage management server for storage. One characteristic of a storage management system is that data can be copied for redundancy, transferred to a new media type, or moved to reclaim unused space from deleted objects. Data redundancy and deduplication, in fact, work well together in a storage management system, because the more data that is deduplicated, the more important it is to have some backup copies of the data within the storage management system to help protect overall data integrity.
The scope of deduplication that is performed within a storage management system may involve opposing considerations. For optimal space efficiency, deduplication should be performed across as large a body of data as possible, to enable the system to identify and reduce redundancy anywhere within that body. For example, optimal space savings is achieved if data from all users with a common operating system is deduplicated and common data is shared. However, other factors may dictate that bounds be placed on the scope of deduplication. For example, it may not be desirable to deduplicate all data, as deduplication may eliminate redundancy that is needed for data protection. Similarly, for data stored on high performance media it may not be desirable to reference common data on lower-performance media, as this would degrade access.
What is needed is a method for allowing the sharing of redundant chunks across a broad scope within a storage management system, while also enabling restrictions on the degree of sharing when appropriate.