Database and data storage systems typically store duplicate copies of the same data across data storage media connected to a data network. For example, consider a large data file that has been distributed to multiple email recipients over an email server in an enterprise network. Multiple copies of the same large file may reside on the email server or across various storage media in the network.
Data deduplication schemes are available that can help remove the duplicate copies and improve the overall network storage space. In large data storage systems, performing deduplication can take a very long time and may also require the utilization of a considerable amount of system resources. As such, there is a time and resources cost associated with deduplicating a large set of data.
To evaluate the cost of deduplication against its benefits, one naïve approach would be to simply apply a data reduction technique to the entire data set and then determine the data reduction rate achieved as the result. Since this approach can be prohibitively expensive in terms of processing time, processing power and memory consumption, it would be desirable to know in advance what the benefits are in terms of data storage.
Currently, a general estimation of the achieved benefits may be calculated based on empirical studies performed on application of different data reduction techniques to various sizes or types of data. This estimation technique is typically inaccurate when dealing with unique data workloads or a specific type of use. Efficient systems and methods that can provide more accurate estimates are desirable.