Every organization desires an ability to quickly restore its business operations from a disaster that causes complete computer system failure and to minimize the disruption to its business in case of a system failure. Such needs have become particularly urgent after the attacks on the World Trade Center in New York City, New York on Sep. 11, 2001, during which many computer systems were destroyed and were out of service for extended periods of time. Many financial transactions were also lost as a result of this disaster. U.S. financial markets were closed for four (4) days until limited computer systems and business operations could be restored. Clearly, the information technology industry is in need of better methods for disaster recovery.
A. Operating System
Every computer requires system software to control and manage its operation. System software and data, often known collectively as operating system (“OS”), are essential and basic software components. It is common practice in a large enterprise environment, to save the operating system and data on a separate disk, known as the OS disk, system disk, or root disk. The OS disk is the most vulnerable and critical component of an enterprise-class computer system.
Many operating systems are compliant with a software industry standard often known as the application binary interface (“ABI”). ABI compliant operating system products are usually distributed as software modules. Each software module can be installed or removed with standard system management utilities. For each business machine, the OS is often uniquely customized. As a result, the actual system data of each business system is unique with respect to its business operation, its hardware components, and whatever software packages have been installed. If two computers are identical, their OS disk image must be identical. Therefore, restoring system software and data is always the first and the key step of any computer recovery procedure.
B. Recovery Methods
In the prior art, recovery methods are parallel to the conventional ways that the OS was originally installed. Typically, OS installation is initially performed within the original computer itself. Usually, it temporarily starts with an OS installation tool, either locally or remotely, and then copies the OS software from the installation media to its own permanent disk. Similarly, recovery processes with all prior methods are implemented within the replacement computer. The replacement computer copies the system backup image from the backup source, either locally from its own backup device or remotely from shared backup devices on the server, to its own permanent OS disk.
OS installation tools typically utilize CD-ROM, DVD, tape, or floppy disks that are often directly attached to the computer in which they are to be installed. OS installation tools are often provided by computer or OS manufacturers. Many manufacturers also provide enterprise software tools for a large scale system installation. Examples of this type of solution include Network Installation Management™ (NIM) from IBM for their AIX™, Ignite-UX™ from Hewlett-Packard for their HP-UX™, JumpStart™ from Sun Microsystems for their Solaris™, and KickStart™ for RedHat™ Linux™. OS installation tools have been widely used in system recovery.
Conventionally, a system recovery is done by (a) starting up the replacement computer from an OS installation tool, (b) reinstalling the OS, (c) updating the OS when necessary to match the original system, (d) reinstalling the backup software, and (e) then restoring files that have been customized and modified. This reinstall-then-restore approach is often error-prone and time-consuming. In practice, a direct restore approach is more desirable.
(1) Local Methods and Drawbacks
Computer manufacturers usually provide a variety of backup and recovery utilities associated with their operating systems. These native backup recovery utilities provide simple and basic tools to allow people having skills in the field to perform backup and recovery operations. If a computer is damaged and its replacement is equipped with a backup device, the replacement computer can be simply rebuilt from the backup image in its directly attached backup device. Methods of this type are often known as local methods. A local method is a simple and widely used recovery method by those skilled in the field, and works well for a small environment. However, local methods pose various problems for a large corporate data center. Local methods poorly utilize backup hardware resources because they require that every computer be equipped with its own backup device. Furthermore, because they require a large number of individual backup devices, they are very difficult to manage. Examples of native backup recovery utilities include “dd”, “dump”, and “restore” among variants of UNIX operating systems, and “XCOPY” and “NTBACKUP” among variants of Microsoft Windows operating systems. Examples of proprietary products of this type include “mksysb” of the IBM AIX operating system and “make_recovery” and “btcreate” of the Hewlett-Packard HP-UX operating system and the Tru64 operating system, and Symantec Norton Ghost for Windows operating systems.
(2) Improved Networked Methods
Networked methods overcome many of the drawbacks of local methods. Networked methods allow backup hardware to be shared and also allow the backup and recovery operation to be centrally managed.
Networked methods are client-server solutions. A recovery server centrally controls and manages backup and restore operations for other computers to be protected. Networked methods utilize computer networks and networking protocols, typically a TCP/IP protocol set. TCP stands for Transmission Control Protocol, and IP for Internet Protocol. TCP/IP is typically implemented in each computer to allow data communication among computers in a computer network.
In case of a failed OS disk, special preparation is required for a recovery process by networked methods. In order to access a recovery image located in the centralized backup storage, a fresh replacement computer needs to start up with not only a basic operating system, but also with a proper networking connection and configuration, and the networked recovery utility. This is because the recovery image is located in the centralized backup storage on the server computer. Furthermore, the replacement computer also requires a proper security permission to be set to communicate with the server computers. The OS disk can be then rebuilt by copying the recovery image from the server over a network.
It is not uncommon for a large data center to be equipped with thousands of business computers. Each system is typically equipped with several gigabytes of system software and data. If only a few computers are required for recovery, recovery speed may be not an issue since it is usually determined by the bandwidth of the network. However, for a mass recovery operation, thousands of recovery client computers flood a large amount of data requests and put too mush CPU load on the server to handle. In the case of a mass recovery operation, TCP/IP becomes a serious performance bottleneck for networked methods, in addition to the problem of limited network bandwidth. This is because networked solutions rely on TCP/IP to accomplish data transfer from the server to its recovery clients. TCP/IP is not efficient in design to handle such a large amount of data transfer.
Examples of networked products for system recovery include NetWorker Recovery Manager™ from EMC Corp, and Tivoli Disaster Recovery Manager™ and Network Installation Management™ from IBM, Bare Metal Restore™ from Veritas Software, and Ghost Corporate Edition™ from Symantec. Examples of networked backup recovery methods also include U.S. Pat. Nos. 5,212,772; 5,133,065; and 5,799,147.
As noted above, local methods all suffer from the drawbacks of manageability and hardware inefficiency. On the other hand, networked methods overcome these drawbacks by utilizing TCP/IP and computer networks. However, TCP/IP and computer networks become serious limiting factors to prevent effective mass recovery operation in the case of a major disaster. Especially after Sep. 11, 2001, same day recovery and no data loss have become business requirements. It is urgent to develop more effective and quicker methods for computer system disaster recovery.