Businesses and other entities store data objects (e.g., image files, text files, computer software, database data, directories and the like) on memory devices such as hard disks. But hard disks fail at the worst times and take all the data objects stored on them with them when they go. Thus was born the concept of creating backups of the data objects on separate recording media (e.g. magnetic tapes). For purposes of description only, the term “data objects” will be understood to mean files, it being understood that the term should not be limited thereto. The following description will be made with reference to backing up a data volume consisting of n files, it being understood that the present invention should not be limited thereto.
Backups protect against hardware failures, software failures, and user errors. Hardware failures can range from the failure of a single hard disk to the destruction of an entire data center, making some or all files of the data volume unrecoverable. Software failures are bugs or procedural errors in, for example, a server application that corrupts the contents of data files. User errors include errors such as inadvertent deletion or overwriting of files that are later required. In these cases, destroyed files generally impact the ability of a user or set of users to function.
Mirroring and replication technology can be configured to provide good protection against hardware failures. But these technologies will also write data corrupted by application errors every bit as reliably as they write correct data, and they faithfiilly record the file system or database metadata updates that result from a user's mistaken deletion of an important file on all mirrors or replicas. Because they are optimized to serve different purposes, mirroring and replication technologies have different goals than backup. Mirroring and replication attempt to preserve the bit-for-bit state of files as they change, while backup attempts to preserve the state of the files as of some past point-in-time at which the files of the data volume were known to be consistent. Mirrors or replicas keep the contents of all replicated devices or files identical to each other. Backup however, does something quite different: it captures an image of the data volume at an instant in the past, so that if need be, everything that has happened to the data volume since that instant can be forgotten, and the state of operations can be restored to that instant.
Backups are typically created during late hours of the night. “Backup windows” are time intervals during which a computer is unoccupied by other tasks and therefore available for making backups of the data volume. Backup windows have been shrinking to accommodate increasing reliance on computers. With round-the-clock transaction processing (so credit cards will be honored at late night diners), the windows continue to shrink to essentially nothing.
Backup operations create backup sets (i.e., copies of one or more files of the data volume) that may be either full or incremental. A full backup set means that all of the files in the data volume are copied, regardless of how recently they have been modified or whether a previous backup set exists. An incremental backup means that only files of the data volume that have changed since some previous event (e.g., a prior full backup or incremental backup) are copied. The backup window for a full backup tends to be much larger when compared to the backup window for an incremental backup. For most applications, incremental backup is preferable at backup time since, in most cases, the number of files of the data volume that change between backups is very small compared to the number of files in the entire data volume and since the backup window is small. If backups are done daily or even more frequently, it is not uncommon for less than 1% of files to change between backups. An incremental backup in this case copies 1% of the data that a full backup would copy and uses 1% of the input/output (IO) resources. Incremental backup appears to be the preferred mode to guarding data. And so it is, until a full restore of all the files of the data volume is required. A full restore from incremental backups entails starting with the restore of the newest full backup copy, followed by restores of all newer incremental backups. That can require a lot of media handling-time performed by, for example, an automated robotic handler. Thus, restore from full backups is generally simpler and more reliable than restore from combinations of full and incremental backups. For recovering from individual user errors, the situation is just the opposite. Users tend to work with one small set of files for a period of days or weeks and then work with a different set. Accordingly, there is a high probability that a file destroyed by a user will have been used recently and therefore will be copied in one of the incremental backup operations. Since incremental backups contain a smaller fraction of data than a full backup, they can usually be searched much faster if a restore is required. The ideal from the individual user's standpoint is therefore many small incremental backups. Some backup systems offer a compromise: the ability to consolidate a baseline full backup and several incremental backups into a new, more up to data full backup, which becomes the baseline for further incremental backups. While costly in terms of the time needed to create them, these synthetic full backups simplify a restoration process.
FIG. 1 illustrates in block diagram form, relevant components of a data processing system 10 which employs an exemplary backup and restore technology. FIG. 1 shows an application server 12 coupled to a data storage subsystem 14 via storage interconnect 16. Data storage subsystem 14 may include several physical storage devices. For purposes of explanation, the physical storage devices of storage subsystem 14 will take form in hard disks, it being understood that the term “physical storage device” should not be limited to hard disks. Further, for purposes of explanation, data storage subsystem 14 will take form in a disk array, it being understood that the term “data storage subsystem” should not be limited thereto. As will be more fully described below, disk array 14 contains an exemplary data volume VE of n files (file 1-file n).
FIG. 1 further includes a backup server coupled to data storage subsystem 22 via storage interconnect 24. For purposes of explanation, data storage subsystem 22 will take form in a robotic tape handler having access to several magnetic tapes. Lastly, application server 12 and backup server 18 are coupled to each other via local area network (LAN) 26. LAN 26 transmits backup data from its source (e.g., disk array 14) to its target (e.g., robotic tape handler 22), or LAN 26 transmits the restoration data from its source (e.g. robotic tape handler 22) to its target (e.g., disk array 14).
FIGS. 2 and 3 illustrate relevant aspects of creating full, incremental and synthetic full backup sets of exemplary data volume VE. FIG. 2 represents disks and tapes that store data volume VE and backup sets thereof. More particularly, FIG. 2 shows data volume VE stored within a disk 30. It is noted that disk 30 may be implemented as a virtual disk or, in other words, a logical aggregation of physical hard disks within disk array 14. Backup server 18 creates a full backup data set 1 of volume VE on tape 32(1) while incremental backup sets 2-m are created on tapes 32(2)-32(m), respectively. A synthetic full backup set is created on tape 34 from files copied from some or all of the backup sets 1-m. All tapes 32(1)-32(m) and 34 are accessible by robotic tape handler 22.
FIG. 3 shows backup catalogs 36(1)-36(m) and 40. Catalogs 36(1)-36(m) and 40 are created by backup server 18 with the creation of backup sets 1-m and the synthetic full backup set, respectively. Catalogs 36(1)-36(m) and 40 identify the files copied to the backup sets 1-m and the synthetic full backup set, respectively. Additionally, catalogs 36(1)-36(m) and 40 directly or indirectly identify locations in tapes 32(1)-32(m) and 34, respectively, where backed up files can be found. All catalogs are stored in cache memory (not shown) of backup server 18. The backup sets and their respective catalogs including their uses are more fully described below.
The full backup set 1 is created by copying each file of data volume VE to tape 32(1) during a backup window. When the backup server 18 creates the full backup set 1, backup server 18 also creates catalog 36(1) listing the files copied to tape 32(1). As shown in FIG. 3, catalog 36(1) includes n entries corresponding to the n files, respectively, of volume VE. Each entry contains a file identification (file ID) and a file offset. The file ID, as its name implies, identifies a file backed up to tape 32(1), and the file offset identifies an offset from a starting address in tape 32(1) where the corresponding file can be found.
With the next scheduled backup window, backup server 18 creates incremental backup set 2 of data volume VE. More particularly, backup server 18 stores on tape 32(2), a copy of all files within data volume VE that were modified (e.g., written) since the creation of full backup set 1. There are many ways to identify files that have been modified since the creation of the full backup set 1. For example, each file of volume VE may have an associated meta data field that indicates the time when the file was last written or modified. During the incremental backup, these meta data time fields are traversed and the time stamps in them are compared to the time when the last backup was performed. If the time stamp in the meta data field is later than the time when the last backup was performed, the corresponding file is deemed modified and subject to backup.
In addition to creating the incremental backup set 2 on tape 32(2), backup server 18 creates catalog 36(2) shown within FIG. 3. Catalog 36(2) identifies only the files contained within the incremental backup set 2. Indeed, all catalogs associated with the incremental backup sets contain only information on files contained within the respective incremental backup sets. Catalog 36(2) includes an entry for each file copied to tape 32(2). Like catalog 36(1), each entry of catalog 36(2) identifies a respective file and its corresponding offset from a starting address within tape 32(2). Using the offset and the starting address, the physical address of each file backed up to tape 32(2) can be calculated.
Backup server 18 may create m−1 incremental backup sets of data volume VE. FIG. 2 illustrates the last incremental backup set m created by backup server 18 on tape 32(m) before creation of the synthetic full backup. FIG. 3 shows catalog 36(m) associated with incremental backup set m. The entries for catalog 36(m) are similar in format to those of catalogs 36(1)-36(m−1).
Backup server 18 can create the synthetic full backup of volume VE using one or more of the backup sets 1-m. The synthetic full backup is created by combining files residing in multiple prior backup sets into a backup set (i.e., the synthetic backup set) that contains the most recent version of each file of volume VE. Tape 34 shown in FIG. 2 is configured to store the synthetic full backup set created by backup server 18. It is noted that in an alternative embodiment, the full backup set and synthetic full backup set may be preferably created on disks (not shown) coupled to backup server 18. Disks are preferable since read access to hard disks is quicker during a restoration operation than read access to tape contained within robotic tape handler 22. For purposes of explanation, it will be presumed that backup sets are stored on magnetic tape media, it being understood that the present invention should not be limited thereto.
The contents of the catalogs 36(1)-36(m) determine which files of the backup sets 1-m are to be combined to create the synthetic full backup. Once the necessary files are identified, their location, with regard to which tapes 32(1)-32(m), must determined by processing the catalogs 36(1)-36(m). It is noted that during the creation of the full or incremental backup sets, one or more files of data volume VE may have been deleted or added. However, for sake of description simplicity, it will be presumed that no files are added to or deleted from volume VE during the backup processes described above.
FIG. 4 illustrates relevant operational aspects of one embodiment for creating a synthetic full backup set using catalogs 36(1)-36(m) and backup sets 1-m. A file (file x) of volume VE to be backed up is identified. The backup server then sets variable y to m+1 and decrements y by 1 as shown in steps 54 and 56. Backup server 18 then begins a search for the most recent version of file x contained within backup sets 1-m. More particularly, backup server 18 accesses catalog 36(y) to determine whether file x is contained within incremental backup set y. It is noted that backup server 18 starts with catalog 36(y=m) because it corresponds to the most recently created incremental backup. If catalog 36(y) indicates that file x is contained within incremental backup set y, then the process proceeds to 62 where backup server 18 copies file x from tape 32(y) to tape 34. The physical address in tape 32(y) of file x is calculated as a function of the file offset contained in catalog 36(y).
If, however, in step 60, catalog 36(y) indicates that file x is not contained in incremental backup set y then the process proceeds to step 70 where backup server 18 determines whether incremental backup set y is the first incremental created after full backup set 1. If it is, then file x contained in full backup set 1 is copied from tape 32(1) to tape 34 as shown in step 72. Backup server 18 can determine whether incremental backup set y is the first incremental created after full backup set 1 by comparing the current state of variable y to 2. If y equals 2, then incremental backup set y is the first incremental created after full backup set 1 and the process proceeds to step 72. If y does not equal 2, then incremental backup set y is not the first incremental created after full backup set 1, and the process proceeds to steps 56 and 60 where y is decremented and catalog 36(y) is checked for file x. Eventually, the most recent version of file x is found and copied to tape 34 in step 62 or 72.
In creating the synthetic full backup set 34, backup server 18 also creates a corresponding catalog 40 shown within FIG. 3. Like catalogs 36(1)-36(m), catalog 40 includes entries, each of which identifies a respective file and its corresponding offset from a starting address within tape 34. In step 64, backup server 18 creates entry x in catalog 40 corresponding to the file x copied in step 62 or step 72. Thereafter, steps 54-64 are repeated for the next file of the data volume VE. After all of the most recent versions of files 1-n have been copied to tape 34, the process has completed.
FIG. 4 shows that at a substantial amount of processing is needed for backup server 18 to create the synthetic full backup set on tape 34. It can also be seen that a substantial amount of backup server 18 processing time may be needed to identify the location within backup sets 1-m of the most recent version of any particular file when that particular file needs to be restored to volume VE.