Computer systems occasionally crash. A “system crash” is an event in which the computer quits operating the way it is supposed to operate. Common causes of system crashes include power outage, application operating error, and computer goblins (i.e., unknown and often unexplained malfunctions that tend to plague even the best-devised systems and applications). System crashes are unpredictable, and hence, essentially impossible to anticipate and prevent.
A system crash is at the very least annoying, and may result in serious or irreparable damage. For standalone computers or client workstations, a local system crash typically results in loss of work product since the last save interval. The user is inconvenienced by having to reboot the computer and redo the lost work. For servers and larger computer systems, a system crash can have a devastating impact on many users, including both company employees as well as its customers.
Being unable to prevent system crashes, computer system designers attempt to limit the effect of system crashes. The field of study concerning how computers recover from system crashes is known as “recovery.” Recovery from system crashes has been the subject of much research and development.
Current database systems support fault-tolerance and high availability by recovering quickly from system failures. In general, the goal of redo recovery is to return the computer system after a crash to a previous and presumed correct state in which the computer system was operating immediately prior to the crash. Then, transactions whose continuations are impossible can be aborted.
Much of the recovery research focuses on database recovery for database computer systems, such as network database servers or mainframe database systems. Imagine the problems caused when a large database system having many clients crashes in the midst of many simultaneous operations involving the retrieval, update, and storage of data records. Database system designers attempt to design database recovery techniques that minimize the amount of data lost in a system crash, minimize the amount of work needed following the crash to recover to the pre-crash operating state, and minimize the performance impact of recovery on the database system during normal operation.
While database recovery techniques are helpful for recovering data, the techniques offer no help in recovering applications that are interacting with the database at the time of failure. Currently, such applications either fail, resulting in an application outage, or are forced to cope with database failures assuming they survive the database crash. The former compromises application availability and can increase operational complexity. The later either severely restricts application flexibility or increases its complexity.
When an application fails because of a database system crash, organizations responsible for the application need to quickly bring the application back on line. In the enterprise-computing world, time is quite literally money. Database recovery ensures that the database state is consistent. However, an application retaining state across database transactions can have consistency requirements that are not captured at the database transaction boundary. Furthermore, parts of the application state may be lost during a crash. Restoring and continuing application execution is all too frequently a very complex and time-consuming operational problem.
In some system configurations, an application can survive a database system crash. For example, when the application executes on a client machine while the database is on a separate server. This permits the application to include logic to deal with database crashes and hence avoid an application outage. However, handling errors or exceptions is a very difficult part of getting applications right. Dealing with database system failures at the application level is tedious and error-prone, even when the application itself stays alive.
There has been some work in this area. One technique exploits logging and recovery techniques to enable applications to be recoverable. See, e.g., Lomet, D. Application recovery using generalized redo recovery. Int'l. Conference on Data Engineering, Orlando, Fla. (February, 1998); and Lomet, D. and Tuttle, M. Redo recovery from system crashes. VLDB Conference, Zurich, Switzerland (September 1995) 457-468. The focus of this work has been to minimize the impact of providing recovery on the normal operation of the system. In practice, this means minimizing the amount of logging and application checkpointing required. Sometimes, it means making the application an object that can be managed by the database recovery manager.
Other prior work on application fault-tolerance in distributed systems is based on some form of application “installation points” and/or “message logging”. The prior work can be categorized into the following three approaches, all of which incur high normal operation and/or recovery costs: (1) fault-tolerant process pairs, (2) distributed state tracking, and (3) persistent queues.
Another client-server system directed to application recovery is described in U.S. patent application Ser. No. 09/033,511, entitled “Client-Server Computer System With Application Recovery of Server Applications and Client Applications”, which was filed Mar. 2, 1998 in the names of David B. Lomet (an inventor in this invention) and Gerhard Weikum. This application is assigned to Microsoft Corporation.
Despite these efforts, there remains a need to improve application recovery techniques in client-server database systems. Particularly, there is a need to provide application recovery at modest system implementation cost that avoids modification to the application itself.