A network storage system is a processing system that is used to store and retrieve data on behalf of one or more hosts on a network. A storage system operates on behalf of one or more hosts to store and manage data in a set of mass storage devices, such as magnetic or optical storage-based disks or tapes. Some storage systems are designed to service file-level requests from hosts, as is commonly the case with file servers used in a network attached storage (NAS) environment. Other storage systems are designed to service block-level requests from hosts, as with storage systems used in a storage area network (SAN) environment. Still other storage systems are capable of servicing both file-level requests and block-level requests, as is the case with certain storage servers made by NetApp, Inc. of Sunnyvale, Calif.
One common use of storage systems is data replication. Data replication is a technique for backing up data, where a given data set at a source is replicated at a destination, which is often geographically remote from the source. The replica data set created at the destination is called a “mirror” of the original data set. Typically replication involves the use of at least two storage systems, e.g., one at the source and another at the destination, which communicate with each other through a computer network or other type of data interconnect.
Replication of data can be done at a physical block level or at a logical block level. To understand the difference, consider that each data block in a given set of data, such as a tile, can be represented by both a physical block, pointed to by a corresponding physical block pointer, and a logical block pointed to by a corresponding logical block pointer. These two blocks are actual the same data block. However, the physical block pointer indicates the actual physical location of the data block on a storage medium, whereas the logical block pointer indicates the logical position of the data block within the data set (e.g., a file) relative to other data blocks. When replication is performed at the physical block level, the replication process creates a replica at the destination storage system that has the identical structure of physical block pointers as the original data set at the source storage system. When replication is done at the logical block level, the replica at the destination storage system has the identical structure of logical block pointers as the original data set at the source storage system, but may (and typically does) have a different structure of physical block pointers than the original data set at the source storage system.
Conventional replication systems have various limitations. Replication at the physical block level has the limitation that it requires that the destination storage system have the identical disk topology (or disk geometry) as the source storage system. For example, some (not all) differences in the Redundant Array of Inexpensive Disk (RAID) configurations between a source storage system and a destination storage system would prevent replication between them at the physical block level. Replication at the logical block level overcomes this limitation, but still requires that the destination storage system have the identical format for directories and other meta-data as the source storage system. On the other hand, conventional systems performing replication at the logical entry level have limitations. Typically, the file system of the source storage system is analyzed to determine changes that have occurred to the file system, and then those changes are transferred to the destination storage system in a particular order. This typically includes “walking” the directory trees at the source storage system to determine the changes to various file system objects within each directory tree, as well as identifying the changed file system object's location within the directory tree structure. The changes are then sent to the destination storage system in a certain order (e.g., directories before subdirectories, and subdirectories before files, etc.) so that the directory tree structure of the source storage system is preserved at the destination storage system. Updates to directories of the source file system are received and processed at the destination storage system before updates to the files in each of the directories can be received and processed. If updates to data in files are received before the updates to the directories that the files are stored in, then files are essentially orphaned because the destination server lacks adequate information as to in which directory the updates to files are to be stored. That is, updates to the data in the file cannot be processed before the directory referencing the file exists on the destination storage system.
The source storage system first performs a search through all the directories in the source storage system to figure out which directories have been updated, and then performs a second search within each directory to figure out which files have been updated in those directories. Moreover, additional searches are performed for file systems that have nested or hierarchical directory structures, such that higher-level directories are searched before tower-level directories (e.g., subdirectories), and so on. This analysis requires the source storage system to walk its way down from the top to the bottom of each of the directory trees of the source storage system before any updates to the file system in source storage system can be transferred to the destination storage system. Then, the updates are transferred to the destination storage system in order so that the destination storage system can properly process the updates to generate the replica file system in the destination storage system. This can take a significant amount of time for large file systems and can impact performance in replication operations at the logical entry level.
Known technology in the area of file system replication includes the Andrew File System (AFS), which provided for the creation of replicas of a volume of data based on a point-in-time copy of the source volume called a “clone,” and also provided for incrementally updating the target replica volume by identifying changes between two clones of a particular volume, and applying those changes to a corresponding clone of the target replica. Clones were created by copying entire inode files describing the file to the replica and incrementing a reference count of a block addressing tree associated with each file. The reference count indicated that the block addressing tree was referenced from an additional file system. In the AFS system, a file system was translated, incrementally or in full. A file system was transferred incrementally by selecting files modified since the previous replication operation and a file system was transferred in full by selecting all files in a volume in the order in which they appeared in the inode files. Entire files and directories were transferred between servers, because the clone granularity was at the level of entire files and no block sharing occurred within a files block addressing tree. The directory contents were transmitted in a logical format containing integers in a standard-byte ordering. In addition, AFS replication could create and manage a target replica with a different type of file system than the source file system.
Another known technology in this area is the DCE/DFS file system, called “Episode,” which extended the work done in AFS by adding support for block-level replication. The Episode file system created what are called “snapshots,” which are well-known in storage systems and used for, among other things, storage management and facilitating replication operations. A snapshot is a persistent image (usually read-only) of a file system or other data container at a point in time. The Episode file system created snapshots by copying an entire inode file for a volume of data to the target replica and setting a hit on each top-level pointer of each inode in the inode file indicating that all of the data under this block pointer (associated with either direct or indirect blocks) should be copied before being modified by further write data (that is, should be treated as “copy on write” data).
All updates to indirect blocks and data blocks were made by writing the new data to previously free newly allocated disk blocks. When generating differences between two snapshots, Episode replication determined differences by iterating over the inodes in the two file systems using an efficient ordering rather than requiring the processing of directories before processing their child files, and/or sub-directories, etc. For each file that had the same generation number in both snapshots (indicating that the file was not deleted between the two snapshots being taken), then for each pointer in the file block addressing trees of both the files in the two snapshots respectively, if the pointers to a data block differed, then that data block was required to be included in the replication propagation. If two pointers were identical, whether direct or indirect, then the replication engine knew that no data anywhere in that block addressing sub-tree had changed between the two replicas, and that no data from that sub-tree needed to be copied.
Note that each directory block was sufficiently self-contained such that a logical description of the changed subset of a directory could be generated from one or more individual changed directory blocks, and that logical description was passed to the target server where directory entries based on this information were created or deleted. This changed subset required including information on all of the directory entries that changed. In at least certain cases, the changed subset also included descriptions of other directory entries that were unchanged between the two snapshots, except that they happened to reside in the same disk block as other changed directory entries.
Finally, the Spinnaker Network's SpinFS file system replication snapshot and replication algorithms worked very similarly to those of the DCE/DFS Episode algorithms. A significant difference; however, was that the SpinFS replication engine simply treated directories as files from the point of view of replica propagation, updating entire blocks of the target directory from the contents of the source directory.