A storage system comprises one or more storage devices to store information. A storage system can include a storage operating system which organizes the stored information and performs operations such as reads and writes on the storage devices. Network based storage, or simply “network storage”, is a common type of storage system for backing up data, making large amounts of data accessible to multiple users, and other purposes. In a network storage environment, a storage server makes data available to client (host) systems by presenting or exporting to the clients one or more logical containers of data. There are various forms of network storage, including network attached storage (NAS) and storage area network (SAN). In a NAS context, a storage server services file-level requests from clients, whereas in a SAN context a storage server services block-level requests. Some storage servers are capable of servicing both file-level requests and block-level requests.
Archival data storage is a central part of many industries, e.g., banks, government facilities/contractors, securities brokerages, etc. In many of these environments, it is necessary to store selected data, e.g., electronic-mail messages, financial documents or transaction records, in an read-only manner, possibly for long periods of time. Typically, data backup operations are performed to ensure the protection and restoration of such data in the event of a failure. However, backup operations often result in the duplication of data on backup storage resources, such as disks and/or tape, causing inefficient consumption of the storage space on the resources.
Furthermore, in a large-scale storage system, such as an enterprise storage network, it is common for certain data to be stored in multiple places in the storage system. Sometimes this duplication is intentional, but often it is an incidental result of normal operation of the storage system. Therefore, it is common that a given sequence of data will be part of two or more different files. “Data duplication”, as the term is used herein, generally refers to unintentional duplication of data in a given storage device or system. Data duplication generally is not desirable, because storage of the same data in multiple places consumes extra storage space, which is a valuable and limited resource.
Consequently, storage servers in many large-scale storage systems have the ability to “deduplicate” data. Data deduplication is a technique to improve data storage utilization by reducing data duplication. A data deduplication process identifies duplicate data in a data set and replaces the duplicate data with references that point to data stored elsewhere in the data set. A data set can be a data volume, data object, data section, data table, data storage, or other type of data collection.
The effectiveness of a deduplication process is dependent on both the algorithm of the deduplication process and on the data in the data set. One way to know how effective a deduplication process will be on a data set is to actually collect and analyze the blocks of the data set. A “block” in this context is the smallest unit of user data that is read or written by a given file system. For example, a common block size in today's storage systems is 4 Kbytes. If the data set is large, the deduplication process can take a long time (e.g., many hours). For instance, a deduplication program can run at a data storage server. The deduplication program scans blocks of an entire volume (i.e., data set) of the data storage server, sorts the blocks, and reports on the deduplication effectiveness based on the number of duplicates of blocks found. This process can take many hours to scan the entire volume before the effectiveness information becomes available to decide whether to enable deduplication on that volume.
Another way to predict the effectiveness of the deduplication process is to run the deduplication process on other, smaller data sets that have similar data patterns as the target data set. However, the effectiveness of this approach varies and heavily depends on how similar the data patterns are between the smaller data sets and the target data set.