1. Field of the Invention
The present invention relates generally to the field of data replication and server recovery techniques for computer operating systems, and in particular, to an apparatus and method providing substantially real-time back-up of an entire server including system state information as well as data.
2. Background Art
A network is a collection of computers connected to each other by various mechanisms in order to share programs, data, and peripherals among computer users. Data on such systems are periodically copied to a secondary “backup” media, for numerous reasons; including computer failure or power shortage that may damage or destroy some or all of the data stored on the system.
A standard approach to backing up data is to perform “full backups” of files on the system on a periodic basis. This means copying the data stored on a given computer to a backup storage device. A backup storage device usually, but not always, supports removable high-capacity media (such as Digital Audio Tape or Streaming Tape). Between full backups, incremental backups are performed by copying only the files that have changed since the last backup (full or incremental) to a backup storage device. This reduces the amount of backup storage space required as files that have not changed will not be copied on each incremental backup. Incremental backups also provide an up-to-date backup of the files, when used in conjunction with the full backup.
One problem with this technique is that the data stored to the backup media is only valid at the exact time the backup is performed. Any changes made after one incremental backup, but before the next, would be lost if there were a failure on the file storage media associated with the computer. Moreover, since the backup process on a large system can take several hours or days to complete, files backed up to the beginning of a tape may have been modified by the time the backup completes.
Another disadvantage of this approach is that with most systems, all files to be copied to backup storage media must be closed before a backup can be performed, which means that all network users must log off the system during the backup process. If files remain open during the backup process, the integrity of the backup data is jeopardized. On a network with hundreds or thousands of users, this can be a time-consuming process. In organizations that require full-time operation of a computer network, this approach is not likely feasible.
To address the problem of backing up open files, techniques have been developed to ensure that no changes are made to a file while it is being backed up. While a file is being copied to backup storage media, the original contents of the data to be overwritten are stored in a “pre-image cache”, which is a disk file allocated specifically for this product. Reads from a backup program are redirected to the pre-image cache if the requested data has been overwritten. Otherwise, the backup read is directed to the original file on disk. Related files on a disk can be “grouped”, so that changes to all files in the group are cached using the technique described above, whenever any one file in the group is being backed up. One problem with this approach is that the resulting backup is still only valid until a change is made to any one of the files on the system.
More recently, several approaches have been developed to backup the data on a computer system in real-time, meaning the data is backed up whenever it is changed. In such known methods, a full backup of the primary storage media is made to a backup media, then incremental backups of changed data is made whenever a change is made to the primary storage media. Since changes are written immediately to the backup media, the backup media always has an updated copy of the data on the primary media. A second hard disk (or other non-volatile storage media) that is comparable in size and configuration is required for this method.
One such approach is to perform “disk mirroring.” In this approach, a full backup of a disk is made to a second disk attached to the same central processing unit. Whenever changes are made to the first disk, they are mirrored on the second disk. This approach provides a “hot-backup” of the first disk, meaning that if a failure occurs on the first disk, processing can be switched to the second with little or no interruption of service. A disadvantage of this approach is that a separate hard disk is required for each disk to be backed up, doubling the disk requirements for a system. The secondary disk must be at least as large as the primary disk, and the disks must be configured with identical volume mapping. Any extra space on the secondary disk is unavailable. Also, in many cases errors that render the primary disk inoperable affect the mirrored disk as well.
Some existing systems have the capability to mirror transactions across a network. All disk I/O and memory operations are forwarded from a file server to a target server, where they are performed in parallel on each server. This includes reads as well as writes. If a failure occurs on the source server, operation can be shifted to the target server. Both the source and target servers must be running the same software in this backup configuration, and a proprietary high-speed link is recommended to connect the two servers. A disadvantage of this approach is that since all operations are mirrored to both servers, errors on the primary server are often mirrored to the secondary server. Local storage on both the source and target servers must be similarly configured.
This network mirroring capability can be used to provide a mechanism to quickly switch from the source server to the target server in the event of a failure. Communication between the source and target servers is typically accomplished via a dedicated, proprietary interface. While the source and target server do not have to be identical, identical partitions are required on the local file system of each server.
Most disaster recovery procedures require that a periodic backup of the system be stored “off-site”, at a location other than where the network is being operated. This protects the backup data in the event of a fire or other natural disaster at the primary operating location, in which all data and computing facilities are destroyed. Baseline and incremental techniques can be used to perform such a backup to removable media, as described above. A disadvantage of the “mirroring” approaches to real-time backup is that the target server or disk cannot be backed up reliably while mirroring is being performed. If a file is open on the target server or disk, as a result of a mirroring operation, it can not be backed up to a separate backup storage device. The result of this limitation is that all users have to be logged off of the system before such a backup can take place.
Standard file replication techniques are unable to back up system state information, because the system state is normally stored in locked, open files, which prevents mirroring. Thus, one approach is to use operating system specific mechanisms to monitor specific systems files and functions. For example, the Windows® API may be used to monitor changes to the registry and replicate those changes to a backup server. The APIs that Microsoft provides for this type of monitoring in Windows® simply notify a program that a key has changed. It is then necessary to either (a) replicate all of the values and subkeys under that key or (b) maintain a separate copy of all the keys, subkeys and values and determine what specifically has changed. Thus, this approach results in additional processing that could significantly burden the CPU, or sending unnecessary data. It also fails to capture additional system state changes outside of the registry.
These foregoing approaches introduce some degree of fault-tolerance to the computer system, since a failure on the primary storage media or computer can be tolerated by switching to the secondary storage media or computer. A disadvantage common to all of these techniques is that there is a one-to-one relationship between the primary and secondary storage media, thereby doubling the hardware resources required to implement mirroring. Even if only a small number of data files on a server are considered critical enough to require real-time replication, a separate, equivalent copy of the server or hard disk is still necessary. If critical files exist on several computers throughout the network, mirroring mechanisms must be maintained at each computer. None of these approaches provides a method for mirroring between multiple computers. Further, none of these approaches allows for saving the entire state of a computer, including both configuration information and data. As a result, if a server failure requires that the server be rebuilt (i.e., reconfigured from a bare installation), that rebuild must still be done manually. Even if some configurations are stored and can be pushed onto a bare server, additional manual configuration will nearly always be required in order to restore the server to its previous state prior to failure. Furthermore, current approaches do not account for situations where different hardware is present on the server to which the failover is being directed.
Thus, there remains a need in the art for improvements in real-time back-up of a server.