A network storage controller is a processing system that is used to store and retrieve data on behalf of one or more hosts on a network. A storage controller operates on behalf of one or more hosts to store and manage data in a set of mass storage devices, such as magnetic or optical storage-based disks or tapes. Some storage controllers are designed to service file-level requests from hosts, as is commonly the case with file servers used in network attached storage (NAS) environments. Other storage controllers are designed to service block-level requests from hosts, as with storage controllers used in a storage area network (SAN) environment.
Still other storage controllers are capable of servicing both file-level requests and block-level requests, as is the case with certain storage controllers made by NetApp, Inc. of Sunnyvale, Calif.
One function commonly employed by storage controllers is data deduplication. Data deduplication eliminates redundant data to improve storage space utilization. For example, in the deduplication process, duplicate data blocks (i.e., data blocks having the same data at different locations on a logical storage device) are deleted from a logical storage device. In a scenario of perfect deduplication, only one instance of each distinct (or unique) data block is stored. Each subsequent instance simply contains a reference to the one saved unique instance of the data block, and thus the illusion is presented to clients that the duplicate copies are still present at their respective locations.
The data deduplication process is able to reduce the required storage capacity by reducing the amount of data (i.e., number of data blocks) that is stored. Reducing the amount of data that is stored requires less physical storage resources, which can reduce overall system cost. However, the benefit of data deduplication can vary depending on a given workload. For example, the data deduplication function may be turned off for certain workloads that do not have a high level of duplication to avoid degradation of input/output (I/O) performance and to avoid metadata overhead.
Accordingly, determining whether to use data deduplication may involve making a determination (or estimate) with respect to the benefit of the data deduplication for a given workload or dataset. Unfortunately, existing deduplication estimations are either not fast enough or not accurate enough. Currently, the simplest way to discover the benefit of data deduplication is to turn on or activate the data deduplication features. If the benefit is not satisfactory, the data deduplication process can be reverted. However, this naïve approach is very time consuming due to the overhead of deduplication.
Various alternative approaches to estimate the potential benefit of data deduplication suffer from low accuracy. For example, one approach to estimate the potential benefit of data deduplication based on the type of workload has low accuracy. Similarly, random sampling based estimations (i.e., based on a random sample of a dataset or volume) have also proven to have low accuracy. This is primarily because, for any random-sampling-based estimation function, there are frequency distributions that cause it to be very inaccurate, unless the sample percentage is very large (e.g., greater than 50% of the dataset size).
Therefore, the problems of computational complexity and latency and poor accuracy when estimating the effectiveness of utilizing a deduplication process pose a significant challenge in determining whether to apply deduplication in a given context.