Practical file systems used for mission critical computing need protection against catastrophic failures, such as system crashes and data loss. Such systems also need protection from unintended corruption of data and user/application errors. A common practice to protect data from permanent loss is to make backup copies of the data periodically. For example, users may copy the data to another file system where the data can be accessed online (typically in a read-only manner) or to backup storage devices such as tapes, where the data stays offline until mounted again. These two approaches do not necessarily exclude each other.
A simple approach to making backups of a file system is to copy the entire contents of the file system to tapes (or other non-volatile mass storage devices) every time a backup is performed. To recover the file system to its state at the time of a particular backup, the entire particular backup is simply restored back to the target file system. However, due to a characteristic typical of most file systems, this approach is not the most desirable. A typical file system does not change rapidly over time. Between two consecutive backups, only a very small percentage (e.g., 5%) of the data may have changed. Most of the data is identical between any two consecutive backups.
The aforementioned characteristic raises two problems. First, on a second backup, a significant amount of tape space tends to be allocated to data that already exists in the first backup. This decreases the efficiency of tape space and raises costs. Second, a significant amount of time tends to be spent saving the same data again in the second backup.
Based on these considerations, the concept of an incremental backup has been introduced. An incremental backup is based on and follows a full backup. After a full backup, the subsequent incremental backup will capture an image of only the data that has changed since the full backup. This process can continue indefinitely, wherein there are an unlimited number of subsequent incremental backups, each based on the previous incremental backup. On recovery, the full backup is first restored. Then the changes represented by the first incremental backup are distributed to the appropriate locations on the file system to bring the file system up to the state at the time the incremental backup was done. This process is iterated for each of the incremental backups.
Note that a similar process can be run with differential backups. Differential backups work similarly to incremental backups, except that differential backups are not always based on the most recent incremental backup. To minimize the number of backups to restore, differential backups may be based on an older incremental, differential or full backup, thus eliminating the need to retain the incremental and differential backups taken between the base backup and the new differential. The cost of this approach is that the differential backup, by aggregating multiple incremental backups into one backup, is additional system load and consumption of backup storage.
One problem with existing recovery algorithms based on full and incremental backups is that they tend to be very slow. It is desirable to have a data recovery algorithm that will efficiently apply an incremental backup to a restored full backup, to transform a file system to its state at the point of the incremental backup. This is known as a true image recovery, in which the recovered data exactly matches the state of the data at the time of backup. This contrasts with extraction recoveries, in which data is recovered from incremental or differential backups without renaming or removing the files that were renamed or removed between the full and incremental backup.
Some file systems are tree-structured. An example of such a file system is the Write Anywhere File Layout (WAFL) file system used in Filer products made by Network Appliance, Inc. of Sunnyvale, Calif. A file system is commonly made up of directories and individual files. The files usually contain the valuable data, while the directories provide a hierarchy/organization of the files. Every directory contains zero or more “entries”, which can be files or sub-directories. There is commonly a root directory, under which there can be multiple sub-directories. Each sub-directory can hold its own sub-directories, and so on. This forms a tree structure organization in which files are distributed among the directories in their defined relationship. On backup and restore, not only is it desirable to recover the data in files, it is also desirable to recover the exact organization of data/files.
In some file servers, a full backup gathers all files and directories and their content from the file system and writes them out to tapes. The incremental backup that follows will only write out files and directories that are new or have been modified since the full backup. The modifications to a directory include addition, removal and renaming of files and sub-directories, as well as content update of an existing file in the directory.
In a typical tree-structured file system such as described above, for every modified directory there is a path that leads to the root directory. Therefore, to generate the full pathname on a recovery, during incremental backup all directories along that path also need to be written out to tapes. These directories might not have been modified; they are in the incremental backup only because one or more of their descendants has been modified. As a result, time is unnecessarily spent processing unmodified directories when the incremental backup is applied to the full backup during recovery.
Another problem with at least one known data recovery algorithm is that in the process of determining which entries to add, delete and keep, every entry in the new directory is compared to every entry in the old directory. The algorithm, therefore, has an order of m×n comparisons between entries in the old directory and entries in the new directory, where m is the number of entries in the old directory and n is the number of entries in the new directory. As a directory grows, the amount of time needed to process the directory during recovery increases exponentially. The algorithm, therefore, does not scale with the directory size.
What is needed, therefore, is a data recovery algorithm, based on full and incremental backups, which overcomes these and other disadvantages of the prior art.