Many enterprises include network storage systems, such as Network Attached Storage (NAS) and Storage Area Networks (SANs), which are connected to client computing systems, whereby clients can access data managed by the storage systems. From the user's (e.g., client's) point of view, the network storage system may include one or more storage objects (storage volumes), often referred to as logical or virtual volumes. Such network storage systems may store very large amounts of duplicate data, and therefore it may be desirable in some instances to perform deduplication in order to use available storage space more efficiently. To the extent that data can be deduplicated in a network storage system, the removal of the duplicate data may in some cases provide significant storage space savings, thereby potentially saving money.
Some conventional techniques for network storage implement file systems corresponding to respective virtual volumes that provide a hierarchical organization of lower-level storage containers (e.g., files) logically organized within a virtual volume and employ pointers to point to the underlying data, where the underlying data is arranged in data blocks. A given file may point to multiple blocks, and a block may be associated with multiple files. Furthermore, a given file may include data that is duplicated in another file. For instance, a storage volume may include multiple email inboxes, each inbox including a particular email attachment. In most scenarios it would be undesirable to store multiple copies of the email attachment because doing so would be wasteful of storage resources. Some conventional deduplication operations avoid saving multiple copies of a piece of data by keeping only a single copy of the data and replacing the duplicate copies with pointers to the single copy. Therefore, multiple files are associated with the same data, but duplicate copies of the data are avoided.
Deduplication operations may use a significant amount of processing resources. In one example, a conventional deduplication process begins on a volume that has not yet been deduplicated. The conventional deduplication process includes reading the data blocks from storage (usually a hard disk), creating fingerprints for each of the data blocks (e.g., a fingerprint can be a small piece of data indicative of the data in a block), and comparing the fingerprints to determine which of the blocks may be duplicates. Duplicate data is then replaced by pointers, as described above. Generally, however, this process may use a noticeable amount of processing power, which may manifest itself as reduced performance from the user's perspective. Furthermore, reading a large number of data blocks from disk may take a relatively long time.
Another conventional deduplication process generates fingerprints of the data blocks as the data blocks are saved or are transferred from one volume to another (e.g., in a backup operation). This should eliminate reading an entire volume in a single operation to fingerprint the entire volume, contrasted with the example above. However, merely comparing the fingerprints to each other may use a noticeable amount of processing resources.
In short, deduplication processes may result in a perceived lack of performance from the user's point of view because of the processing resources that are allocated to the deduplication processes and not used for concurrent storage and retrieval operations that are more visible to the user. Assuming that a network storage system has a limited amount of processing resources to devote to the various operations that it performs, it would be desirable to perform deduplication efficiently so as to get the most amount of deduplication from the least amount of processing resources.