Clusters of servers or nodes are frequently used to deliver network services. In that regard, the clusters manage resources that provide services. Sometimes it is necessary or desirable to backup (or take a snapshot of) a resource in a cluster. In that regard, when performing backups or creating a snapshot of data, it is desirable to have a consistent data set. This requires that the processes which manage the data flush their caches, buffers, and queues to the persistent storage, so that this data remains consistent. Moreover, during the time the snapshot is created or the backup is running, the process or service should not process requests and operations that could dirty the data set and make it inconsistent. Therefore, the resources used by the service or process are briefly frozen to suspend operation and reach a consistent data state. Then, the resources are thawed to resume operation after the snapshot or backup is completed.
The foregoing situation is complicated by the fact that certain resources running on a node may have dependencies that affect the order in which resources must be frozen and thawed. Clearly, the need to properly sequence the freeze (suspend) and thaw (resume) of the resources creates complexity. Moreover, the foregoing situation is further complicated by the fact that in the event of a failure during freeze, backup, or thaw, it is desirable to perform node recovery. For example, the node may be shut down, cleaned up, and restarted. Additionally, the resources may also need to be shut down and restarted. However, if those resources have dependencies then an orderly recovery process requires a proper sequencing of stopping and starting the resource on the failed node and the other resources which are its dependencies. Therefore, a backup or snapshot of resources is further complicated by the need to recover from a failure during the freezing, backup, or thawing of a resource.
The foregoing situation is still further complicated by the fact that a resource may be distributed across multiple nodes. For example, in a clustered file system the file system is distributed across multiple nodes. In that regard, for orderly backup of the clustered file system it is desirable to coordinate the backup of each instance of the file system. In particular, it is desirable to coordinate the freeze of the distributed file system in a manner so that each instance of the file system is simultaneously frozen. Thereafter, it is desirable to coordinate the backup and thaw of the clustered file system in a manner so that each instance of the clustered file system is simultaneously backed-up and simultaneously thawed. Accordingly, in general it is desirable to coordinate the backup of a resource that is present on multiple nodes so that the backup is orderly.