Every organization desires the ability to quickly restore its business operations from a disaster that causes complete computer system failures and to minimize the disruption to its business in case of a system failure. Such needs have become particularly urgent after the attacks on the World Trade Center in New York City, N.Y. on Sep. 11, 2001, during which many computer systems were destroyed, and were out of service for extended periods of time. As a result of this disaster many financial transactions were lost. US financial markets were closed for four (4) days until limited computer systems and business operations could be restored. Clearly, the information technology industry is in need of better methods for disaster recovery.
A. Operating System
Every computer requires system software to control and manage its operation. System software and data, often collectively known as the operating system (“OS”), are the essential elements of any system. The OS manages system hardware and software resources and provides various services to other non-system software, also known as application software. It can generally be separated from other non-system software components. It is a common practice, in a large enterprise environment, to save operating system and data on a separate storage device, often known as OS device, system device, OS disk, system disk, root disk, or operating system environment. The OS device is typically implemented with a physical disk or a logical disk. The system device is the most vulnerable and critical component of an enterprise-class computer system.
Modern operating systems are usually implemented with a layered software design. The innermost portion of an operating system is a core that manages system hardware and basic system operation. This core operating system manages all hardware and other software resources. It controls hardware devices and peripherals through special device files. This layer of OS is hardware dependent and shields the rest of the software components from the underlying hardware. The OS core provides basic system functions to all other parts of the system. This hardware dependent layer is often referred to as the kernel. Sometimes software components that control and manage hardware devices are known as device drivers. Traditionally OS cores are designed and implemented as a monolithic program that consequently exists on an OS disk as a single binary file. On the other hand, modern operating system cores are often modular and dynamic in design and divided into sets of linked modular programs. These OS core programs are usually segregated from other software codes in separate file directories on the OS disk.
Many modern operating systems are modular in software design, and are usually distributed as software modules. Each software module is a collection of files and directories that together provides a specific software function. For example, a device driver for a specific hardware device consists of a set of program files arranged within a specific file directory structure and provides specific functions to control and manage the said device. Software modules can be installed or removed with standard system management utilities to conform to a software industry standard known as the application binary interface (ABI). For each business machine, the OS is often uniquely customized. As a result, the actual system software and data of each business system is unique with respect to its business operation, actual hardware components, and software packages installed.
In case a system is destroyed, restoring the system software and data is always the first and the key step of any computer recovery procedure. A system recovery procedure generally relies on backup images of the OS disk or replicated system device. An OS backup image is made up of complete operating system files and data that an identical system device can be rebuilt from. If the hardware of the replacement machine is the same as the original system, reproducing the original system device is generally sufficient for a system recovery.
In practice, however, it is not uncommon that identical replacement hardware is not available, especially at the time of a disaster. Generally, if the replacement hardware is different from the original system, the replacement with a replicated system device will run into problems, since the originally installed hardware dependent software components are not compatible with the new replacement hardware. In particular, new model hardware devices require different device driver programs. Consequently, hardware dependent system files and data on the rebuilt system device need to be modified or replaced to adapt to the new hardware. These hardware dependent system files and data are generally very small, typically less than one percent of the entire operating system disk image in size. These relevant codes and data can be generated or derived from OS installation tools provided by OS manufacturers. As long as the replacement hardware is supported by the OS, the replacement system can be installed with the same release level of the OS as the original. Thus the working OS image contains all working system files and data including those specific to the replacement hardware. Updating hardware dependent system files and data can be accomplished by using standard system administration utilities, including standard file management and software management tools.
When a computer is initially powered up, the normal facilities provided by the OS are not available. During the starting up process, a few small startup programs must be executed first to load the core of the OS into system memory. Procedures for loading the OS core from the system device may vary among vendors and hardware platforms. For example, many systems implement the process in two steps to locate and load the OS core. The primary startup program is usually kept in nonvolatile memory, while the secondary startup programs are typically located in the first few blocks of the operating system disk. These startup programs are non-OS programs, and are usually located outside of the OS file-system. The secondary startup programs are also an essential part of an operating system disk.
B. Recovery Methods
In the prior art, recovery methods are parallel to the conventional ways in which the OS was originally installed. Typically, an OS installation is initially performed within the original computer itself. Usually, it temporarily starts with an installation media, either locally or remotely. It first loads the installation operating system kernel, and executes it in the system memory. The installation kernel contains a limited subset of system utilities provided by the normal operating system. It enables the system to be installed to copy the OS software from the installation media to its own permanent disk. Similarly, recovery processes with all prior methods are implemented within the replacement computer. The replacement computer temporarily starts with a recovery tool, typically an installation media, and copies the system backup image from the backup source, either locally from its own backup device or remotely from shared backup devices on the backup/recovery server, to its own permanent OS disk.
(1) Local Methods and Drawbacks
Computer manufacturers usually provide a variety of backup and recovery utilities associated with their operating systems. These native backup recovery utilities provide simple and basic tools to allow skilled technicians to perform backup and recovery operation. If a computer is damaged and its replacement is equipped with a backup device, the replacement computer can be simply rebuilt from the backup image in its directly attached backup device. The methods of this type are often known as local methods. A local method is a simple and widely used recovery method by those skilled in the field, and works well for a small environment. However, local methods pose various problems for a large corporate data center. Local methods poorly utilize backup hardware resources because they require that every computer be equipped with its own backup device. Furthermore, because they require a large number of individual backup devices, they are very difficult to manage. Examples of native backup recovery utilities include “dd”, “dump”, and “restore” among variants of UNIX operating systems, and “XCOPY” and “NTBACKUP” among variants of Microsoft Windows operating systems. Examples of proprietary products of this type include “mksysb” of the IBM AIX operating system and “make_recovery” and “btcreate” of the HP HP-UX operating system and the Tru64 operating system, and Symantec Norton Ghost for Windows operating systems.
(2) Improved Networked Methods
Networked methods overcome many drawbacks of local methods. Networked methods allow the backup hardware to be shared and also allow the backup and recovery operation to be centrally managed.
Networked methods are client-server solutions. A recovery server centrally controls and manages backup and restore operations for other computers to be protected. Networked methods utilize computer networks and networking protocols, typically the TCP/IP protocol set. TCP stands for Transmission Control Protocol, and IP for Internet Protocol. TCP/IP is typically implemented in each computer to allow data communication among computers in a computer network.
In case of a failed OS disk, special preparation is required for a networked methods recovery. In order to access a recovery image located in the centralized backup storage, a fresh replacement computer needs to start with not only a basic operating system, but also with a proper networking connection and configuration, and the networked recovery utility. This is because the recovery image is located in the centralized backup storage on the server computer. Furthermore, the replacement computer also requires proper security permission to be set to communicate with the server computers. The OS disk can be then rebuilt by copying the recovery image from the server over network.
It is not uncommon for a large data center to be equipped with thousands of business computers. Each system is typically equipped with several gigabytes of system software and data. If only a few computers are required for recovery, recovery speed may be not an issue since it is usually determined by the bandwidth of the network. However, for a mass recovery operation, thousands of recovery client computers flood a large amount of data requests and put too much CPU load on the server to handle. In case of a mass recovery operation, TCP/IP becomes a serious performance bottleneck for networked methods, in addition to the problem of limited network bandwidth. This is because networked solutions rely on TCP/IP to accomplish data transfer from the server to its recovery clients. TCP/IP is not efficient in design to handle such a large amount of data transfer.
Examples of the networked products for system recovery include NetWorker Recovery Manager™ from EMC Corp, and Tivoli Disaster Recovery Manager™ from IBM, Bare Metal Restore™ from Veritas Software, and Ghost Corporate Edition™ from Symantec. Examples of networked backup recovery methods also include U.S. Pat. Nos. 5,212,772; 5,133,065; and 5,799,147.
As noted above, the local methods all suffer from the drawbacks of manageability and hardware inefficiency. On the other hand, networked methods overcome these drawbacks by utilizing TCP/IP and computer network. However, TCP/IP and computer network become serious limiting factors to prevent effective mass recovery operation in the case of a major disaster. Especially after Sep. 11, 2001, same-day recovery and no data loss have become business requirements. Therefore, it is urgent to develop more effective and quicker methods for computer system disaster recovery.