A networked data storage system can be used for a variety of purposes, such as providing multiple users access to shared data, or facilitating backups or data mirroring. Such data is hereinafter referred to as user data because it is accessed and/or manipulated by users of the networked data storage system. A networked storage system may include a number of storage servers. A storage server may provide services related to accessing and organizing user data on mass storage devices, such as disks. Some storage servers are commonly referred to as filers or file servers, as these storage servers provide clients with file-level access to user data. Some of these filers further provide clients with sub-file level and/or block-level access to user data. An example of such a storage server is any of the filer products made by Network Appliance, Inc. in Sunnyvale, Calif. The storage server may be implemented with a special-purpose computer or a general-purpose computer programmed in a particular way. Depending on the application, various networked data storage systems may include different numbers of storage servers.
In addition to user data, the networked data storage systems have metadata. In general, metadata is created by operating systems of the storage servers in the networked data storage systems for organizing and keeping track of user data. A portion of the metadata is used to create persistent point-in-time images (PPIs) of user data. One example of the PPIs is NetApp Snapshot™ provided by the filer products made by Network Appliance, Inc. in Sunnyvale, Calif.
To facilitate disaster recovery, a first storage server may replicate both user data and metadata (collectively referred to as data in the current document) in a first volume into a second volume, where the second volume becomes a mirror image of the first volume. A volume is a logical data set which is an abstraction of physical storage, combining one or more physical mass storage devices (e.g., disks) or parts thereof into a single logical storage object, and which is managed as a single administrative unit, such as a single file system. The relationship between the first and the second volumes are referred to as a mirroring relationship because the second volume is a mirror image of the first volume. Note that the second volume may be managed by the first storage server or a second storage server. In the following discussion, it is assumed that a second storage server manages the second volume. However, one should appreciate that the concept described herein is applicable to situations in which the first storage server also manages the second volume.
When a PPI is created on the first volume, its metadata is created and replicated onto the second volume. The PPI could be considered “copied” when all the user data and metadata associated with it are replicated over to the second storage server. Then the second storage server makes the PPI available to clients accessing the second storage server.
Conventionally, the first storage server waits for the second storage server to make the PPI available to clients. Thus, the first storage server depends on the second storage server making the PPI available to clients before the first storage server can process other requests that come after the PPI write. The above approach is further illustrated by the flow diagram in FIG. 1A.
Referring to FIG. 1A, the blocks on the left hand side of a dotted line 350 are performed by a source storage server and the blocks on the right hand side of the dotted line 350 are performed by a destination storage server. A PPI operation to replicate a PPI starts in block 340. The source storage server writes the PPI metadata to the destination storage server in block 342. In response to the write, the destination storage server copies the PPI metadata in block 344. After copying the PPI metadata, the destination storage server makes the PPI available to clients at the destination storage server in block 346. At this point, the metadata of both the source and the destination storage servers' volumes are in sync. Finally, an acknowledgement 349 is sent to the source storage server, which now considers the PPI metadata write to be completed in block 348. At this point, all volumes' metadata is in sync and writes that come after the metadata write are allowed. The source storage server also considers the PPI operation done at this point.
However, the above approach leads to both performance issue and potential deadlock situations. There is performance issue because the first storage server would hold up write operations that come in after the PPI have been written to the second storage server until the PPI has been made available to clients by the second storage server. A deadlock could occur when the storage servers have dependencies among themselves. In order to explain the dependencies, the concept of consistency points (CPs) is explained below.
A CP is a predefined event that occurs regularly in a storage server. The CP involves a mechanism by which batches of client writes are committed to a volume's storage devices. A CP occurs when all the user data and metadata of the volume is committed in a transaction-style operation, i.e., either all of the user data and the metadata are committed or none of the user data and the metadata is committed. To do so, the storage server writes the user data and metadata to the volume's storage devices without making any of the user data and metadata active. Then the storage server makes the user data and metadata written active by writing a superblock down to the volume's storage devices. A superblock is a block of metadata which describes the overall structure of the storage system within the volume. For example, the superblock may contain references to metadata of PPI, information about the system that is made available to clients, the name of the storage system, the size of the system, etc. Information in the superblock is used to keep track of the state of the storage system during operation.
Creating a PPI involves adding more volume metadata (a.k.a. PPI metadata) to those already written down to the volume. When a CP is done, all the data and metadata, which includes the PPI metadata, is put down on the volume's storage devices (e.g., a disk), and the superblock is written out. Note that conventionally, the PPI metadata may be in place, but the PPI is not considered created until the superblock is written.
In some conventional system configuration, such as a bi-directional configuration or a circular configuration, the CPs of two storage servers may be dependent on each other such that both sides are waiting on the other side to transition to their respective next CP before they could move onto their next CPs. Specifically, a PPI operation on a source storage server results in a CP on the source storage server. The PPI operation further sends user data and metadata to a destination storage server. The destination storage server would buffer the user data and metadata in memory and write the user data and metadata to storage devices (e.g., disks) corresponding to a mirror volume in the background. Since it is a PPI operation, the destination storage server needs to flush all of the user data and metadata to the storage devices before the destination storage server can make the PPI available to clients. Note that the user data and metadata are written to the storage devices indirectly through a file-system CP operation conventionally. In other words, a CP has to be triggered on the destination storage server and thus, the CP on the source storage server is linked to the CP on the destination storage server. As such, a deadlock between the two storage servers results. A conventional exemplary bi-directional configuration is shown in FIG. 1B.
Referring to FIG. 1B, the system 300 includes two storage servers 310 and 320, coupled to each other via a networked connection 305. The storage server 310 manages volumes A 311 and B′ 312. Likewise, the storage server 320 manages volumes B 321 and A′ 322. The data in volume A 311 is replicated onto volume A′ 322 and the data in volume B 321 is replicated onto volume B′ 312, as represented by the dashed arrows 306A and 306B, respectively. In other words, volume A′ 322 is a mirror image of volume A 311 and volume B′ 312 is a mirror image of volume B 321. Thus, the storage servers 310 and 320 are in a bi-directional mirroring configuration. The storage servers 310 and 320 are further coupled to client machines, which may access data in the volumes A 311, B′ 312, B 321, and A′ 322 via the storage servers 310 and 320, respectively.
When the storage server 310 writes the metadata of a PPI of volume A 311 to the storage server 320 in order to replicate the PPI to volume A′ 322, the storage server 320 does not send a confirmation or acknowledgement for the write to the storage server 310 until the metadata has been copied onto the storage server 320 and the storage server 320 has made the PPI available to clients according to the conventional practice. If the storage server 320 simultaneously writes the metadata of a second PPI of volume B 321 to the storage server 310, the storage server 310 likewise would not send a confirmation for this write to the storage server 310 until the metadata has been copied onto the storage server 310 and the storage server 310 has made the PPI available to clients. However, neither storage servers 310 and 320 would complete the process of making its own PPI available to clients because the storage servers 310 and 320 have outstanding write operations as the storage servers 310 and 320 have not yet received the confirmation for the write operations. As such, a deadlock is resulted between the storage servers 310 and 320.
Note that a similar problem exists with multiple conventional storage servers having a circular configuration. For example, three conventional storage servers A, B and C may have a potential deadlock between them when storage server A mirrors one of its volume to another volume managed by storage server B, storage server B mirrors one of its volume to another volume managed by storage server C, and storage server C mirrors one of its volume to another volume managed by storage server A.