A network storage controller is a processing system that is used to store and retrieve data on behalf of one or more hosts on a network. A storage controller operates on behalf of one or more hosts to store and manage data in a set of mass storage devices, such as magnetic or optical storage-based disks or tapes. Some storage controllers are designed to service file-level requests from hosts, as is commonly the case with file servers used in a network attached storage (NAS) environment. Other storage controllers are designed to service block-level requests from hosts, as with storage controllers used in a storage area network (SAN) environment. Still other storage controllers are capable of servicing both file-level requests and block-level requests, as is the case with certain storage controllers made by NetApp, Inc. of Sunnyvale, Calif.
One common use of storage controllers is data mirroring. Mirroring is a technique for backing up data, where a given data set at a source is replicated exactly at a destination, which is often geographically remote from the source. The replica data set created at the destination is called a “mirror” of the original data set. Typically mirroring involves the use of at least two storage controllers, e.g., one at the source and another at the destination, which communicate with each other through a computer network or other type of data interconnect to create the mirror.
Mirroring can be done at a physical block level or at a logical block level. To understand the difference, consider that each data block in a given set of data, such as a file, can be represented by both a physical block, pointed to by a corresponding physical block pointer, and a logical block pointed to by a corresponding logical block pointer. These two types of blocks are actually the same data block. However, the physical block pointer indicates the actual physical location of the data block on a storage medium, whereas the logical block pointer indicates the logical position of the data block within the data set (e.g., a file) relative to other data blocks.
When mirroring is done at the physical block level, the mirroring process creates a mirror that has the identical structure of physical block pointers as the original data set. When mirroring is done at the logical block level, the mirror has the identical structure of logical block pointers as the original data set but may (and typically does) have a different structure of physical block pointers than the original data set. These two different types of mirroring have different implications and consequences under certain conditions, as explained below.
Before considering this further, note that in a large-scale storage system, such as an enterprise storage network, it is common for large amounts of data, such as certain data blocks, to be duplicated and stored in multiple places in the storage system. Sometimes this duplication is intentional, but often it is an incidental result of normal operation of the system. As such, a given block of data can be part of two or more different files. Data duplication generally is not desirable from the standpoint that storage of the same data in multiple places consumes extra storage space, which is a limited resource.
Consequently, in many large-scale storage systems, storage controllers have the ability to “deduplicate” data, which is the ability to identify and remove duplicate data blocks. In one known approach to deduplication, any extra (duplicate) copies of a given data block are deleted (or, more precisely, marked as free), and any references (e.g., pointers) to those duplicate blocks are modified to refer to the one remaining instance of that data block. A result of this process is that a given data block may end up being shared by two or more files (or other types of logical data containers).
Deduplication is typically done at the physical block level, not at the logical block level. As a result, two different logical data blocks in two different files may correspond to (share) the same physical data block. The sharing of logical data blocks due to deduplication can cause inefficiencies, however, if deduplication is employed with logical mirroring.
In logical mirroring, a mirroring application at the source from time to time identifies logical data blocks that have been modified and sends those modified logical data blocks to the destination as part of a mirror update process. However, the mirroring application reads logical data blocks, not physical data blocks, and is therefore unaware of the effects of deduplication at the source. Consequently, two logical data blocks that have been modified will be sent by the mirroring application to the destination even if they correspond to (share) the same physical data block. This results in the same data being sent more than once over the connection from the source to the destination during a mirror update, resulting in unnecessary extra bandwidth consumption. Furthermore, if the destination does not also perform deduplication before committing the update to storage, the duplicate blocks will be written to storage media at the destination, resulting in unnecessary use of storage space at the destination. Moreover, while deduplication can be performed at the destination, doing so undesirably consumes processing resources.
One known approach to logical mirroring is to place a self-contained device between the source and the destination, to perform deduplication. This device identifies any duplicate data blocks that are being sent over the connection from the source to the destination and essentially filters them out, so that they do not reach the destination. One drawback of this approach, however, is that while duplicate data blocks do not reach the destination, they are still read from storage and transmitted by the source onto the connection between the source and the destination. This is because the logical mirroring application at the source still reads only logical data blocks; consequently, any physical data blocks that are shared by two or more logical data blocks will still be read by the mirroring application and transmitted onto the connection to the destination. This results in unnecessary read activity at the source, which consumes processing resources and can reduce performance.
Also, in the above-mentioned approach the self-contained device has to analyze the content of each and every data block that the source sends over the connection to the destination, to determine the duplicate blocks. The system cannot leverage any duplication information that may already be present at the source; consequently, it ends up using more CPU time, power, etc.