The present disclosure pertains to an improved approach to implement data de-duplication. With de-duplication, the goal is to minimize the number of copies of a given data item that is stored in a storage system. If a data item already exists in the system and is subject to de-duplication, then the storage management system will not store extra copies of that same data item. Instead, the storage management system recognizes that the data item has already been stored and will reuse the existing copy of that data item.
Whenever a data item is de-duplicated, metadata may be created by the storage management system pertaining to the de-duplication. The metadata includes, for example, identification of the specific items of de-duplicated data, information about references to the actual data item, and reference counts for the de-duplicated data.
Many systems that attempt to provide de-duplication functionality may seek to de-duplicate all of the data in the system, or by de-duplicating data which possesses a fingerprint in the system. However, even with the large-scale storage devices provided to modern information processing systems, there may be a finite amount of room that is available to store metadata for de-duplication. In addition, valuable system and computing resources may need to be consumed to actually implement the de-duplication functionality. If substantial benefits exist for performing de-duplication on a given item of data, then the metadata storage and de-duplication processing costs are usefully expended for the de-duplication. However, blindly de-duplicating all data that exists in the system, or at least all data for which a fingerprint exists, will likely lead to inefficient results since a substantial portion of the data in the system may not provide substantial-enough storage savings to offset the cost of de-duplication.
Therefore, there is a need for an improved approach to implement de-duplication that does not require de-duplication of all data within a system, but is capable of identifying data for which de-duplication will provide high-yield returns for the investment of the de-duplication resources.