Server computer systems are generally powerful computer systems configured for functioning in a network environment that is shared by multiple users. Servers come in all sizes from x86-based PCs to IBM mainframes. A server may have a keyboard, monitor and mouse directly attached, or one keyboard, monitor and mouse may connect to any number of servers (e.g., via a KVM switch). Servers may be also be primarily accessed remotely, for example, through a network connection. Example server implementations include application servers, database servers, mail servers, transaction servers, web servers, and the like.
A server operating system, or operating environment, comprises a primary mechanism through which servers implement their functionality. An operating system typically refers to a “master control program” that runs the computer system. The operating system is typically the first program loaded when the computer is turned on. In general, the main portion of the operating system, the “kernel,” resides in memory at all times. The operating system sets the standards for all application programs that run in the server. The applications “talk to” the operating system for all user interface and file management operations. Also called an “executive” or “supervisor,” a server operating system insures the multitasking functionality, whereby multiple programs are executed within the computer system at the same time. The number of programs that can be effectively multitasked depends on the type of multitasking performed (preemptive vs. cooperative), CPU speed and memory and disk capacity.
Generally, many programs can be run simultaneously in the computer because of the differences between I/O and processing speed. For example, while one program is waiting for input, instructions in another can be executed. During the milliseconds one program waits for data to be read from a disk, millions of instructions in another program can be executed. In interactive programs, thousands of instructions can be executed between each keystroke on the keyboard. In large computers, multiple I/O channels also allow for simultaneous I/O operations to take place. Multiple streams of data are being read and written at the exact same time. In the days of mainframes only, multitasking was called “multiprogramming,” and multitasking meant “multithreading.” The primary operating systems in use are the many versions of Windows (95, 98, NT, ME, 2000, XP), the many versions of UNIX (Solaris, Linux, etc.), the Macintosh OS, IBM mainframe OS/390 and the AS/400's OS/400.
A problem exists due to the fact that all server operating system environments are vulnerable to failures that can cause data loss or corruption. Commercial server operating system environments (e.g., Solaris, UNIX, IRIX, etc.) typically provide a mechanism by which data may be restored in the event of a failure (e.g., during hardware/software upgrades, maintenance, etc.). Most commercial server operating system environments restore data by utilizing the most recent backup copy of the operating system environment in conjunction with a transaction log (e.g., ghost management software). In order to accomplish data restoration using a traditional rollback mechanism, the ghost management software typically utilizes backups of a full image of the operating system environment. A transaction log is a file that records all changes to user and system data since the last full image database backup. The transaction log captures the state of the operating system environment before and after changes are made.
Many prior art operating system environment reconstruction techniques restore the server operating system environment by using the most recent full image backup, or ghost. The server operating system environment is then “rolled forward” to a point in time, just prior to the time of the failure, by reapplying every transaction from each transaction log backup file saved since the last full image server operating system environment backup. This procedure effectively restores the server operating system environment to the state in which it existed just prior to the server operating system environment failure. One drawback of a roll-forward server operating system environment reconstruction is that all transactions that were started, and possibly completed, after the time of the failure are lost.
These traditional backup and recovery techniques are designed to protect data from hardware and media failures, such as disk crashes. In a typical server operating system environment, however, a more likely cause of data corruption is a user or application error. An incorrectly timed user program context switch or an application software error may inappropriately delete or modify data in the server operating system environment. Traditional restore and roll forward mechanisms are sub-optimal and inefficient for recovering data from such user or application errors.
Additionally, many causes of data corruption are due to incorrectly and partially applied operating system updates or changes. An incorrectly devised software program upgrade to one or more files of the operating system may lead to a software that may inappropriately delete or modify data in the server operating system environment. Traditional restore and roll forward mechanisms are inadequate for ensuring upgrades properly test and checkout, and alternatively, rolling back to the previous version in case of problems.
Furthermore, when the operating system environment is cloned, or otherwise imaged to create the ghost, the opportunity exists for data corruption between the time the current operating system environment is cloned, and when the cloned operating environment is booted upon a restart. The server applications must remain “live” as long as possible. Preferably, upgrades and maintenance is performed on a “hot swap” basis. Data may change on the current operating system environment before the cloned environment is activated, and those changes may not be reflected in the cloned environment. Also, in those cases where the cloned environment is brought live, changes may be made to the current, cloned, or both environments, resulting in inconsistent software state, confusing any traditional restore and roll forward mechanism.
Although some server operating system environments (e.g., Microsoft Windows NT Server, Sun Solaris, etc.) provide a recovery feature in which a “restore and roll forward” may be performed, to some extent such features mitigate the inefficiency of the prior art restore and roll forward methods described above, they are still not sufficient. Such methods cannot reliably function in the demanding, high availability, requirements of many server operating system environment applications.
Thus a need exists for a method and apparatus for restoring a server operating system environment that mitigates the inefficiencies of traditional roll forward techniques, while simultaneously allowing recovery of any intervening changes which may be implemented during any backup and restore process.