Network storage is a common approach for making large amounts of data accessible to many users and/or for backing up data. In a network storage environment, a storage server makes data available to client systems by presenting or exporting to the clients one or more logical containers of data. There are various known forms of network storage, including network attached storage (NAS) and storage area networks (SANs). In a NAS context, a storage server services file-level requests from clients, whereas in a SAN context a storage server services block-level requests. Some storage servers are capable of servicing both file-level requests and block-level requests.
In a large-scale storage system, such as an enterprise storage network, it is common for some data to be duplicated and stored in multiple places in the system. Sometimes data duplication is intentional and desirable, as in the case of data mirroring, for example. Often, however, data duplication is an incidental byproduct of normal operation of the storage system. For example, a given sequence of data may be part of two or more different files, LUNS, etc. Consequently, it is frequently the case that two or more segments of data stored at different locations in a storage server are identical. Unintentional data duplication generally is not desirable because storage of the same data in multiple places consumes additional storage space, which is a limited resource.
Consequently, in many large-scale storage systems, storage servers have the ability to “deduplicate” data. Deduplication is a well-known method for increasing the capacity of a storage device or system by replacing multiple copies of identical sequences of data with a single copy, together with a much smaller amount of metadata that allows the reconstruction on demand of the original duplicate data. Techniques for deduplicating data within a network storage server are known and in commercial use today.
Deduplication tends to be a very resource-intensive process, especially when the amount of data to be deduplicated is large. To understand why this is so, first consider how deduplication is commonly performed. Initially, duplicate data segments are identified; this phase is called duplicate detection. A conventional method involves using a hash function, such as SHA-1, to compute an integer, called a “fingerprint”, from each data segment, such that different data segments are extremely unlikely to produce the same fingerprint. The fingerprints are stored in one or more fingerprint files, e.g., one fingerprint file for each data volume maintained by the system.
The stored fingerprints are subsequently used to detect potential duplicate data segments. Duplicate detection may be done, for example, when a new data segment is about to be written, or it may be done as a background process. If two data segments have the same fingerprint, there is a high probability they are identical. Accordingly, when two data fingerprints are determined to be identical, the underlying data segments are compared byte by byte to determine if they are in fact identical. If the data segments are identical, deduplication is performed on them.
When the amount of data maintained by the storage system is large, such as in an enterprise-scale network storage system, even the fingerprint files can become extremely large, e.g., potentially hundreds of gigabytes each. Consequently, locating fingerprints in these files for purposes of duplicate detection and removal would be very inefficient and take a long time unless the fingerprint files are first sorted. With current techniques, however, the sorting process itself is extremely memory-intensive; yet often only a small portion of working memory of the system is made available for deduplication, since it is usually desirable to keep most space in memory available for servicing user requests. Also, a fingerprint file is typically much larger than the amount of available working memory in the system. Consequently, each fingerprint file is broken up into smaller chunks that are sorted separately, one at a time, and then merged back together. This process can take an inordinately long time, especially if there are multiple fingerprint files to be sorted; with conventional techniques the sorting process alone can take hours for an enterprise-scale storage system. The overall performance of the deduplication process therefore drops drastically as the amount of data to be deduplicated (and hence the size and number of fingerprint files) increases.
As a result, in conventional network storage systems the fingerprint sorting process (and therefore the overall deduplication process) is not very scalable relative to the amount of data maintained by the storage system. This problem hinders the goal of supporting customers' demands for storage servers that can handle and efficiently deduplicate data sets of ever-increasing sizes.