A network storage server is a processing system that is used to store and retrieve data on behalf of one or more hosts on a network. A storage server operates on behalf of one or more hosts to store and manage data in a set of mass storage devices, such as magnetic or optical storage-based disks or tapes. Some storage servers are designed to service file-level requests from hosts, as is commonly the case with file servers used in a network attached storage (NAS) environment. Other storage servers are designed to service block-level requests from hosts, as with storage servers used in a storage area network (SAN) environment. Still other storage servers are capable of servicing both file-level requests and block-level requests, as is the case with certain storage servers made by NetApp, Inc. of Sunnyvale, Calif.
One common use of storage servers is data mirroring. Mirroring is a technique for backing up data, where a given data set at a source is replicated exactly at a destination, which is often geographically remote from the source. The replica data set created at the destination is called a “mirror” of the original data set. Note that in a large-scale storage system, such as an enterprise storage network, it is common for large amounts of data, such as certain data blocks, to be duplicated and stored in multiple places in the storage system. Sometimes this duplication is intentional, as in the case of mirroring, but often it is an incidental result of normal operation of the system. Such incidental data duplication generally is not desirable from the standpoint that storage of the same data in multiple places consumes extra storage space, which is a limited resource.
One form of long term archival storage is the storage of data on magnetic tape media. A noted disadvantage of magnetic tape media is the slow data access rate and the added requirements for managing a large number of physical tapes. In response to these noted disadvantages of magnetic tape media, several storage system vendors provide virtual tape library (VTL) systems, in the form of network storage servers, that emulate tape storage devices using, for example, disk drives. In typical VTL environments, the storage system performs a complete backup operation, i.e. a mirror, of the storage system's file system (or other data store) to the VTL system. Multiple complete backups of such a data set may occur over time thereby resulting in undesirable duplicate data and inefficient utilization of storage space on the VTL system.
Consequently, in many large-scale storage systems, storage servers have the ability to “deduplicate” data, which is the ability to identify and remove duplicate data in a data set. Many deduplication techniques involve identifying anchors within the data set to be deduplicated. As used herein, an “anchor” is a location within a data set in a region of interest for potential data de-duplication. Some techniques utilize a rolling hash to identify anchors within the data set. Typically, such techniques are computationally expensive and thus contribute latency to the deduplication process. Latency in the deduplication process has negative consequences including, for example, difficulty in or an outright inability to perform deduplication on particularly large data sets.
These negative consequences arise when, for example, a data set is being received over a network by a storage server. In one case, the data set is too large to fit on the storage server prior to deduplication, but small enough to fit on the storage server after deduplication. In such a case, the storage server cannot store the too-large data set locally for the purpose of deduplicating it. Instead, the storage server must deduplicate the data set “live,” i.e., during ingest of the data set at a rate determined by the network's bandwidth, so that only the non-duplicate portion of the data set is actually stored at the storage server. However, as stated above, deduplication techniques are computationally expensive. Thus, the computational resources of the storage server typically cannot deduplicate the data set being received at the network rate. In the worst case, this leads to a failed backup operation. One possible way to deal with this negative consequence is to reduce the sender's data rate, but this leads to network underutilization and an increase in the total amount of time required for backing up the data set.