Frequently, businesses, institutions, and the like use data protection management systems to protect their data from accidental loss and/or corruption. Simply stated, a data protection management system replicates information from protected volumes into a storage pool. If and when it is needed, the replicated information in the storage pool can be retrieved.
FIG. 1 is a block diagram illustrating an exemplary data protection management environment 100. Shown in FIG. 1 are three separate servers, file server 102, SQL server 104, and Exchange server 106. Each of these servers 102-106 is illustrated as being connected to a storage device, storage devices 108-112 respectively. These storage devices may be protected in whole, or in part, by the data protection server 116. The protected content, i.e., that content on a server which is designated for protection by the data protection management system, is replicated by the data protection server 116 into a storage pool 118 associated with the data protection server 116. For example, assuming that content in volume 114 on the storage device 108 connected to file server 102 is to be protected, this volume is (at some point) replicated into the storage pool 118 by the data protection server 116, such as shown by volume 114′ on storage device 112.
As those skilled in the art will appreciate, the storage pool 118 typically comprises a number of large storage devices, such as storage devices 120-128. The storage devices 120-128 are typically slower and cheaper than those used or connected to the various protected servers, such as storage devices 108-110. The storage pool 118 can use slower devices since they are not relied upon for immediate storage purposes, nor needed in normal operation of the protected server, such as file server 102. Instead, their use is in replicating and restoring files, and as such, higher latency times can be tolerated.
Assuming the contents of protected volume 114 on the storage device 108 is lost, corrupted, or otherwise needed from another location, a process directs the data protection server 116 to retrieve the replicated volume 114′ from the storage pool 118 and return it to the process, either to store it back onto the storage device 108 or use it some other manner.
With many data protection management systems 100, such as Microsoft Corporation's System Center Data Protection Manager, data protection occurs in two stages. The first stage involves simply copying/replicating the protected content (i.e., volumes, files, storage devices, etc.) from the protected server, such as file server 102, to the data protection server's storage pool 118. Once the protected content is in the storage pool 118, the second stage involves capturing modifications to the protected content and making those changes to the replicated content in the storage pool. Capturing the modifications to the protected content is described below in regard to FIG. 2.
FIG. 2 is a block diagram illustrating various components installed on a server, such as file server 102, for protecting content associated with the server in conjunction with a data protection server 116. In particular, components installed on the file server 102 include a data protection agent 202 and a file system filter 204. The file system filter 204 interacts with the operating system to detect modifications to the protected content on the file server 102. In short, the file system filter 204 hooks into the operating system, typically on the kernel level, such that it acts as an extension of the operating system that detects when modifications are made to protected content. As those skilled in the art will appreciate, perhaps the most use of the file system filter 204 are in regard to anti-virus applications which scan particular file types for corruption, malware, etc.
The data protection agent 202 is the user mode counterpart of the file system filter 204 on the file server 202. The data protection agent 202 is in communication with the data protection server 116 in handling requests for the initial replicated content, log files (a collection of change records described below), and restoration requests. In many cases, the data protection agent 202 is the link between the file system filter 204 and the change records, and the data protection server 116.
With regard to the file server 102, as modifications are made to protected content, the file system filter 204 captures these modifications and records each modification as a change record in a records cache, such as records caches 206 or 208. Typically, a file server 102 will include multiple records caches that are usually retained in random access memory.
With regard to the change records, it should be appreciated that each change record represents a single modification action only, not the entirety of a modified file. For example, if a file is modified by overwriting a particular range of updated data, only the action to be taken (i.e., write), the file identifier, the range, and the updated data are written to a change record. Action specific information is recorded with each type of modification to the protected content (create, deletion, etc.) that is needed to capture the essence of the modification. As those skilled in the art will appreciate, by subsequently applying the change records to the replicated content, the replicated content is brought “up-to-date” with the modified protected content on the file server system.
As indicated above, records caches are typically random access memory areas and are of limited size. Thus, as a records cache fills, the change records in the cache are transferred to a log file 212 in a special area 210 on the protected volume that is not protected in the typical manner by the data protection management system. The change records in the records caches 206 and 208 are also transferred to the log file 212 on external directives to “flush” their contents (change records) to the log file.
On a periodic basis, the data protection server 116 requests the log file from the protected file server 102, via the data protection agent 202. In order to properly field the request, the data protection agent 202 will typically direct the file system filter 204 to first flush any change records cached in the records caches to the log file 212. Thereafter, the contents (change records) of the log file 212 are transferred to the data protection server 166, and the data protection server applies the change records from the log file to the replicated content in the storage pool 118, thereby bringing the replicated content up to date with the protected content.
At least one problem with the data protection model described above is when a protected server, such as file server 102 is actually a clustered file server, or cluster for short. As appreciated by those skilled in the art, a cluster is a group of independent computers that operate collectively and appear to a client (user or other computers) as if it were a single computer system. Clusters are designed to improve capacity and ensure reliability in the case of a failure. For example, when one of the nodes in the cluster fails, the operations carried out by that cluster can be shifted over to another cluster node. Unfortunately, this “failover” is also the source of difficulties with regard to data protection management.
FIGS. 3A and 3B are block diagrams for illustrating suggested ways in which a data protection server can interact with a clustered server, and the problems related therein. With regard to FIG. 3A, this block diagram illustrates the data protection server 116 operating with a cluster 302, treating the cluster as a single file server with a protected content. The cluster 302 is shown as including three cluster nodes, nodes 306-310, but this is for illustration purposes only, and should not be construed as limiting upon the present invention.
As shown in FIG. 3A, when treating the cluster 302 as a single file server, only one data protection agent 318 and one file system filter 312 have been deployed onto the cluster, and arbitrarily they were placed on node 306.
As those skilled in the art will appreciate, in a clustered environment, even though all cluster nodes are potentially able to communicate with a particular volume 304, only one cluster node, such as cluster node 308, can communicate with the volume at any one time. All other connections between the cluster's nodes and the volume are potential, not actual connections (as illustrated by the dotted connecting lines.) As a product of the cluster, any reads, writes, creations, deletions, etc., that affect the content on the volume 304 are directed to the one cluster node 308 that is in current, actual communication with the volume.
In this light, one problem with treating the cluster 302 as a single entity, that is quite evident with regard to data protection management, is that only one data protection agent 318 and file system filter 312 is deployed on the cluster, and it may or may not actually correspond to the cluster node 308 that is in actual communication with the volume 304. Thus, modifications directed to the volume 304 may or may not be recorded by the file system filter 312, and the ability of the data protection server 116 to update the replicated content would be lost. Of course, even if the data protection agent 318 and file system filter 312 were initially installed on the same cluster node that had actual communication with the cluster volume 304, the nature of cluster technology is that upon any number of conditions, e.g., node failure, reallocation of process, etc., the cluster node with the actual connection may change. As such, even if the data protection agent 318 and file system filter 312 are installed on the cluster node in actual communication with the protected cluster volume 304, the data protection system could not be trusted to provide reliable data protection.
On the other hand, as illustrated in FIG. 3B, the data protection management system could alternatively distribute data protection agents, such as data protection agents 318-322, and file system filters, such as file system filters 312-316, on each cluster node 306-310. This means, of course, that the data protection server 116 must be cluster-aware, and as such, the data protection server must communicate with all data protection agents 318-322 to obtain the change records/log file for the protected content. Of course, each file system filter 312-316 may have change records stored in one or more records caches 206-208, depending on when a failover or transfer of duties occurred among the various cluster nodes with regard to actual communication with the cluster volume 304 (assuming it is the protected content). At its best, this means substantial extra work for the data protection server 116 in resolving the sequences of when the various change records occurred. However, more likely, this means that upon failover or transfer in the cluster 302, the sequence of change records recorded by the various file system filters 312-316 in the records caches becomes hopelessly obscured, to the point that any attempt by the data protection server 116 to apply the modifications outlined by the change records to the replicated content could only result in corrupting the replicated content.
It is for these reasons described above that some data protection management systems simply exclude clusters from their protection.