A network storage controller is a processing system that is used to store and retrieve data on behalf of one or more hosts on a network. A storage controller operates on behalf of one or more hosts to store and manage data in a set of mass storage devices, such as magnetic or optical storage-based disks or tapes. Some storage controllers are designed to service file-level requests from hosts, as is commonly the case with file servers used in a network attached storage (NAS) environment. Other storage controllers are designed to service block-level requests from hosts, as with storage controllers used in a storage area network (SAN) environment. Still other storage controllers are capable of servicing both file-level requests and block-level requests, as is the case with certain storage controllers made by NetApp, Inc. of Sunnyvale, Calif.
In a large-scale storage system, such as an enterprise storage network, it is common for certain items of data, such as certain data blocks, to be stored in multiple places in the storage system, sometimes as an incidental result of normal operation of the system and other times due to intentional copying of data. For example, duplication of data blocks may occur when two or more files have some data in common or where a given set of data occurs at multiple places within a given file. Duplication can also occur if the storage system backs up data by creating and maintaining multiple persistent point-in-time images, or “snapshots”, of stored data over a period of time. Data duplication generally is not desirable, since the storage of the same data in multiple places consumes extra storage space, which is a limited resource.
Consequently, in many large-scale storage systems, storage controllers have the ability to “deduplicate” data, which is the ability to identify and remove duplicate data blocks. In one known approach to deduplication, any extra (duplicate) copies of a given data block are deleted (or, more precisely, marked as free), and any references (e.g., pointers) to those duplicate blocks are modified to refer to the one remaining instance of that data block. A result of this process is that a given data block may end up being shared by two or more files (or other types of logical data containers).
A problem with at least one known deduplication technique is that it is not scalable. The deduplication technique uses a sorting algorithm. Because the amount of memory used for sorting is bounded, the time required for sorting (and, therefore, the time required to detect and eliminate duplicate data blocks) increases disproportionately (e.g., geometrically) with the size of the data set to be deduplicated.
Consequently, for very large data sets, deduplication can be extremely time consuming. Deduplication consumes processor and memory resources, and as such, it can adversely impact the performance of the storage system for purposes of servicing normal user requests. Therefore, it is desirable to spend as little time on deduplication as possible.