A network storage controller is a processing system that is used to store and retrieve data on behalf of one or more hosts on a network. A storage server is a storage controller that operates on behalf of one or more clients to store and manage data in a set of mass storage devices, such as magnetic or optical storage-based disks or tapes. Some storage servers are designed to service file-level requests from clients, as is commonly the case with file servers used in a network attached storage (NAS) environment. Other storage servers are designed to service block-level requests from clients, as with storage servers used in a storage area network (SAN) environment or virtual tape (VTL) environment. Still other storage servers are capable of servicing both file-level requests and block-level requests, as is the case with certain storage servers made by NetApp, Inc. of Sunnyvale, Calif.
In a large-scale storage system, such as an enterprise storage network, it is common for certain items of data, such as certain data blocks, to be stored in multiple places in the storage system, sometimes as an incidental result of normal operation of the system and other times due to intentional copying of data. For example, duplication of data blocks may occur when two or more logical containers of data (e.g., files, virtual tapes, etc.) have some data in common or where a given set of data occurs at multiple places within a given logical container of data. For example, consider a scenario where a user creates a Windows PowerPoint® document and stores the document at a first location of the storage system. The user sends the document over to another user, who then makes minor edits to one small section of the document. The second user then stores the edited document in a different location of the storage system. Here, we have two data objects (i.e., the two Windows® documents) that operate as logical containers of data. Each data object logically references a set of physical data blocks that stores the data corresponding to the documents. Since the two documents are substantially similar, there will be a substantial number of identical data blocks (i.e., duplicate data blocks) referenced by the two data objects. Duplication can also occur if the storage system backs up data by creating and maintaining multiple persistent point-in-time images, or “snapshots,” of stored data over a period of time. Data duplication generally is not desirable, since the storage of the same data in multiple places consumes extra storage space, which is a limited resource.
Consequently, in many large-scale storage systems, storage controllers have the ability to “deduplicate” data, which is the ability to identify and remove duplicate data blocks. In one known approach to deduplication, data blocks referenced by two data objects (i.e., an original data object and a new data object) are compared and duplicate data blocks referenced by one of the data objects (e.g., the new data object) are deleted. Any references (e.g., pointers) of the new data object to those duplicate (deleted) data blocks are modified to refer to the remaining instances of those data blocks (i.e., the data blocks referenced by the original data object). A result of this process is that a given data block may end up being shared by the two data objects.
Further, in known deduplication approaches, the new data object (i.e., the data object referencing deduplicated data blocks) stores one-to-one mapping of each deleted data block to a corresponding data block referenced by the original data object. Such one-to-one mapping data is generated and stored for each individual deduplicated data block. Typically, such mapping information is stored in metadata associated with the data object.
Now consider the above example of the PowerPoint® documents. Each time a new copy of the document (i.e., a new data object) is created by a user to make minor edits, new data blocks (much of which are substantially similar to the original document) are created and stored within the storage server. Accordingly, a substantial number of data blocks referenced by these new copies (i.e., new data objects) will be identical to the original copy of the document. Even after the duplicate data blocks referenced by the new data object are deleted, each data object will still hold a substantial amount of “mapping metadata” to store mapping information for each deduplicated data block, consequently increasing the storage footprint of the corresponding data objects.
Additionally, in at least one known deduplication technique, duplicate data identification is performed only at the level of the physical data blocks. In other words, even when two data objects (e.g., files) are exactly or substantially identical (i.e., all or a substantial number of data blocks referenced by the data objects are duplicates), known deduplication techniques still compare each underlying data block referenced by the two data objects to identify duplicate data blocks. Such deduplication techniques do not have mechanisms for recognizing duplicate data at the object level (without comparing the underlying physical data blocks). Identification of duplicate data at the data block level, especially between identical or substantially similar data objects, unnecessarily burdens computational resources of the storage system.