Network storage is a common approach for making large amounts of data accessible to many users and/or for backing up data. In a network storage environment, a storage server makes data available to client systems by presenting or exporting to the clients one or more logical containers of data. There are various forms of network storage, including network attached storage (NAS) and storage area networks (SANs). In a NAS context, a storage server services file-level requests from clients, whereas in a SAN context a storage server services block-level requests. Some storage servers are capable of servicing both file-level requests and block-level requests.
One common application of network storage is data mirroring. Mirroring is a technique for backing up data, where a given data set at a source is replicated exactly, at a destination, which is often geographically remote from the source. The replica data set created at the destination is called a “mirror” of the original data set. Typically mirroring involves the use of at least two storage servers, e.g., one at the source and another at the destination, which communicate with each other through a computer network or other type of data interconnect to create the mirror.
In a large-scale storage system, such as an enterprise storage network, it is common for some data to be duplicated and stored in multiple places in the storage system. Sometimes data duplication is intentional and desired, as in mirroring, but often it is an incidental byproduct of normal operation of a storage system. For example, a given sequence of data may be part of two or more different files, LUNS, etc. Consequently, it is frequently the case that two or more blocks of data stored at different block addresses in a storage server are actually identical. Data duplication generally is not desirable, since storage of the same data in multiple places consumes extra storage space, which is a limited resource. Consequently, in many large-scale storage systems, storage servers have the ability to “deduplicate” data.
Deduplication is a well-known method for increasing the capacity of a storage device or system by replacing multiple copies of identical sequences of data with a single copy, together with a much smaller amount of metadata that allows the reconstruction on demand of the original duplicate data. Techniques for deduplicating within a single storage server (or a single node in a storage cluster) are in wide-spread commercial use.
A related use of deduplication is to reduce the amount of data sent over a network, such as in a data mirroring system. If the recipient of transmitted data stored a set of data segments, and another node of the network wants to send it another data segment, deduplication techniques can be used to avoid sending the data segment if the recipient already has an exact copy of it. This is called network deduplication, or network acceleration, because it increases the effective bandwidth of the network.
The conventional method for identifying duplicate data segments involves using a hash function, such as SHA-1, to compute an integer, called a “fingerprint”, from each data segment, where different data is extremely unlikely to produce the same fingerprint. When one node of a network wishes to send a data segment to another node, but only if the data segment is not already present on the other node, the sending node can first send the fingerprint, and the receiving node can inform the sending node whether or not it already has a data segment with that fingerprint. Only if the fingerprint is not found on the receiving node is the data segment sent.
There are two problems with the use of a hash value as a data fingerprint. Firstly, while it is very unlikely, it is possible that two different data segments can produce the same hash value. If that occurs, data corruption can result. Further, the larger the amount of data managed by a given system in a given period of time, the greater is the likelihood that two different data segments actually will produce the same hash value. In a very large-scale storage system, therefore, this very small likelihood can increase to an unacceptably high value.
Additionally, hash values generated by conventional hash algorithms can be quite lengthy, e.g., at least 160 bits (as with SHA-1). Consequently, computing and comparing hash values can be computationally intensive, consuming a significant amount of processor resources. Likewise, a significant amount of storage space can be required to store the hash values in a given storage server or node.