Various forms of network-based storage systems exist today. These forms include network attached storage (NAS), storage area networks (SAN's), and others. Network-based storage systems are commonly used for a variety of purposes, such as providing multiple users with access to shared data, backing up critical data (e.g., by data mirroring), etc.
A network-based storage system typically includes at least one storage server, which is a processing system configured to store and retrieve data on behalf of one or more client processing systems (clients). In the context of NAS, a storage server may be a file server, which is sometimes called a “filer”. A filer operates on behalf of one or more clients to store and manage shared files. The files may be stored in a storage system that includes one or more arrays of mass storage devices, such as magnetic or optical disks or tapes, by using a data storage scheme such as Redundant Array of Inexpensive Disks (RAID). Additionally, the mass storage devices in each array may be organized into one or more separate RAID groups. In a SAN context, a storage server provides clients with block-level access to stored data, rather than file-level access. Some storage servers are capable of providing clients with both file-level access and block-level access, such as certain storage servers made by NetApp, Inc. (NetApp®) of Sunnyvale, Calif.
Storage servers may implement a deduplication algorithm. Deduplication eliminates redundant copies of data that is stored within the data storage. Deduplication is accomplished in several ways, including hierarchical deduplication, in-line deduplication, and background deduplication. Hierarchical deduplication includes deriving one file from another, usually by one file starting off as copy of another, but zero or nearly zero bytes of data are actually copied or moved. Instead, the two files share common blocks of data storage. An example is a snapshot, where a snapshot is made of a file system, such that the snapshot and active file system are equal at the time snapshot is taken, and share the same data storage, and thus are effectively copies that involve zero or near zero movement of data. As the source file system changes, the number of shared blocks of data storage reduces. A variation of this is a writable snapshot (also referred to as a clone) which is taken of a file system. In this variation as the source and cloned file systems each change, there are fewer shared blocks. In-line deduplication includes a storage access protocol initiator (e.g. an NFS client) creating content via write operations, while the target of the storage access protocol checks if the content being written is duplicated somewhere else on the target's storage. If so, the data is not written. Instead, the logical content (e.g., metadata, pointer, etc.) refers to the duplicate. Background deduplication includes a background task (e.g., on a storage access protocol target) scanning for duplicate blocks, freeing all but one of the duplicates, and mapping corresponding pointers (or other logical content) from the now free blocks to the remaining duplicate.
Additionally, clients may implement a hypervisor software layer. A hypervisor software layer, also referred to as a virtual machine monitor, allows the client processing system to run multiple virtual machines (e.g., different operating systems, different instances of the same operating system, or other software implementations that appear as “different machines” within a single computer). Deduplication, in its various forms, is of particular interest when a client implements a hypervisor software layer because multiple virtual machines often use the same data (e.g., to run the same program) and the hypervisor software layer allows the virtual machines to utilize the single copy of the common page, file, or other unit of data. As a result, deduplication is able to reduce required storage capacity because primarily only the unique data is stored. For example, a system containing 100 virtual machines might contain 100 instances of the same one megabyte (MB) file. If all 100 instances are saved, approximately 100 MB storage space is used. With data deduplication, only one instance of the file is actually stored and each subsequent instance is just referenced back to the one saved copy. In this example, a 100 MB storage demand could be reduced to only 1 MB (for the data). Indexing of the data, however, is still retained. For example, a smaller amount of memory (when compared to storing multiple copies of the data) is used to store metadata for each instance.
Accordingly, in a network-based storage system, data that is managed by a storage server and shared by multiple clients (multiple client machines and/or virtual machines within one or more client machines) may benefit from deduplication. Due to the large amount of data managed and stored by a storage server, clients may be unaware data redundancies within the storage system and that may have been eliminated by deduplication. As a result, a client may send an input/output (I/O) request to the server to retrieve a page at a particular virtual address that contains data that is a duplicate of data already sent to and stored within the client. The client may not be aware that the requested page has been deduplicated by the server or that the client may be currently storing a redundant copy of the data because the redundant/deduplicated data is associated with a different virtual address. In response to such an I/O request, the server sends the redundant/deduplicated data to the client, consuming communication channel resources such as available bandwidth between the server and clients. The client may then store (e.g., in a cache) a copy of the redundant/deduplicated data, consuming storage resources within the client.