Computer systems occasionally crash. A "system crash" is an event in which the computer quits operating the way it is supposed to operate. Common causes of system crashes include power outage, application operating error, and computer goblins (i.e., unknown and often unexplained malfunctions that tend to plague even the best-devised systems and applications). System crashes are unpredictable, and hence, essentially impossible to anticipate and prevent.
A system crash is at the very least annoying, and may result in serious or irreparable damage. For standalone computers or client workstations, a local system crash typically results in loss of work product since the last save interval. The user is inconvenienced by having to reboot the computer and redo the lost i1 work. For servers and larger computer systems, a system crash can have a devastating impact on many users, including both company employees as well as its customers.
Being unable to prevent system crashes, computer system designers attempt to limit the effect of system crashes. The field of study concerning how computers recover from system crashes is known as "recovery." Recovery from system crashes has been the subject of much research and development.
In general, the goal of redo recovery is to return the computer system after a crash to a previous and presumed correct state in which the computer system was operating immediately prior to the crash. Then, transactions whose continuations are impossible can be aborted. Much of the recovery research focuses on database recovery for database computer systems, such as network database servers or mainframe database systems. Imagine the problems caused when a large database system having many clients crashes in the midst of many simultaneous operations involving the retrieval, update, and storage of data records. Database system designers attempt to design the database recovery techniques which minimize the amount of data lost in a system crash, minimize the amount of work needed following the crash to recover to the pre-crash operating state, and minimize the performance impact of recovery on the database system during normal operation.
FIG. 1 shows a database computer system 20 having a computing unit 22 with processing and computational capabilities 24 and a volatile main memory 26. The volatile main memory 26 is not persistent across crashes and hence is presumed to lose all of its data in the event of a crash. The computer system also has a non-volatile or stable database 28 and a stable log 30, both of which are contained on stable memory devices, e.g. magnetic disks, tapes, etc., connected to the computing unit 22. The stable database 28 and log 30 are presumed to persist across a system crash. The persistent database 28 and log 30 can be combined in the same storage, although they are illustrated separately for discussion purposes.
The volatile memory 26 stores one or more applications 32 and a resource manager 34. The resource manager 34 includes a volatile cache 36, which temporarily stores data destined for the stable database 28. The data is typically stored in the stable database and volatile cache in individual units, such as A cache manager 38 executes on the processor 24 to manage movement of data pages between the volatile cache 36 and the stable database 28. In particular, the cache manager 38 is responsible for deciding which data pages should be moved to the stable database 28 and when the data pages are moved. Data pages that are moved from the cache to the stable database are said to be "flushed" to the stable state. In other words, the cache manager 38 periodically flushes the cached state of a data page to the stable database 28 to produce a stable state of that data page which persists in the event of a crash, making recovery possible.
The resource manager 34 also has a volatile log 40 that temporarily stores log records for operations, which are to be moved into the stable log 30. A log manager 42 executes on the processor 24 to manage when the operations are moved from the volatile log 40 to the stable log 30. The transfer of an operation from the volatile log to the stable log is known as a log flush.
During normal operation, an application 32 executes on the processor 24. The resource manager receives requests to perform operations on data from the application. As a result, data pages are transferred to the volatile cache 36 on demand from the stable database 28 for use by the application. During execution, the resource manager 34 reads, processes, and writes data to and from the volatile cache 36 on behalf of the application. The cache manager 38 determines, independently of the application, when the cached Data State is flushed to the stable database 28.
Concurrently, the operations being performed by the resource manager on behalf of the application are being recorded in the volatile log 40. The log manager 42 determines, as guided by the cache manager and the transactional requirements imposed by the application, when the operations are posted as log records on the stable log 30. A logged operation is said to be "installed" when the versions of the pages containing the changes made by the operation have been flushed to the stable database.
When a crash occurs, the application state (i.e., address space) of any executing application 32, the data pages in volatile cache 36, and the operations in volatile log 40 all vanish. The computer system 20 invokes a recovery manager. It begins at the last flushed state on the stable database 28 and replays the operations posted to the stable log 30 to restore the database of the computer system to the state as of the last stably logged operation just prior to the crash.
While database recovery techniques are helpful for recovering data, the database techniques offer no help in recovering an application from a system crash. Usually all active applications using the database are wiped out during a crash. Any state in an executing application is erased and cannot usually be continued across a crash.
There has been some work in designing recovery procedures that preserve applications across a system crash. One preferred approach is an application recovery system developed by David Lomet, an inventor in this invention. The application recovery system is described in a series of patent applications:
1. U.S. Ser. No. 08/814,808, entitled "Database Computer System With Application Recovery", filed Mar. 10, 1997; PA1 2. U.S. Ser. No. 08/813,982, entitled "Database Computer System With Application Recovery And Dependency Handling Read Cache", filed Mar. 10, 1997; PA1 3. U.S. Ser. No. 08/832,870, entitled "Database Computer System With Application Recovery And Dependency Handling Write Cache", filed Apr. 4, 1997; and PA1 4. U.S. Ser. No. 08,826,610, entitled "Database Computer System With Application Recovery And Recovery Log Sequence Numbers To Optimize Recovery", filed Apr. 4, 1997.
All of these patent applications are assigned to Microsoft Corporation and are incorporated by reference. These applications are collectively referred to as the "Lomet applications" throughout this disclosure.
Another approach is to make the application "stateless." Between transactions, the application is in its initial state or a state internally derived from the initial state without reference to the persistent state of the database or to other input. If the application fails between transactions, there is nothing about the application state that cannot be re-created based on the static state of the stored form of the application. Should the transaction abort, the application is replayed, thereby re-executing the transaction as if the transaction executed somewhat later. After the transaction commits, the application returns to the initial state. Gray and Reuter describe this form of transaction processing in a book entitled, Transaction Processing: Concepts and Techniques, Morgan Kaufmann (1993), San Mateo, Calif.
Another approach is to write persistent application checkpoints at every resource manager interaction. The notion here is that application states in between resource manager interactions can be re-created from the last such interaction forward. This is the technique described by Bartlett, "A NonStop Kernel," Proc. ACM Symp. on Operating System Principles (1981) pages 22-29 Borg et al. "A Message System Supporting Fault Tolerance," Proc. ACM Symp. on Operating System Principles (Oct. 1983) Bretton Woods, NH pages 90-99.
The above application recovery techniques are all restricted to recovery local to or under the control of a single recoverable resource manager (database computer system). In the client-server context, however, these recovery techniques are difficult to apply to client-side applications that are interacting with the server.
Prior work on application fault-tolerance in distributed systems is based on some form of application "installation points" and/or "message logging". The prior work can be categorized into the following three approaches, all of which incur high normal operation and/or recovery costs.
1. Fault-tolerant Process Pairs: This approach has aimed to build fault tolerance into the operating system by providing each critical process with a hot-standby backup process, usually on a different processor. When the primary process fails, the backup process takes over and re-executes the application starting from the most recent installation point that has been generated by the primary process. Messages that would be repeated, especially output to the human user, are suppressed during the re-executed path, based on testing sequence numbers against logged messages. While this approach was a pioneering one in the early eighties, it is a heavyweight solution that can be justified only for the most mission-critical high-end applications. The reason for this is that the method requires either an application installation point or a forced message log record at every process interaction. The frequently required disk I/O greatly limits the achievable throughput of both the server and the clients. This approach is described in the above cited Bartlett and Borg papers, as well as by Borr, "Transaction Monitoring in Encompass: Reliable Distributed Transaction Processing," VLDB Conference, Cannes (1981) and by Kim, "Highly Available Systems for Database Applications," ACM Computing Surveys, Vol. 16, No. 1 (1984), pp. 71-98.
2. Distributed State Tracking: This approach is based on a model of communicating processes. Processes generate installation points only occasionally and independently of each other. In addition, messages are logged in an optimistic, non-forced manner. When a process fails, it restarts from its most recent installation point, but other processes may also be forced to restart from a former state to guarantee a causally consistent global state. Thus, this line of methods incurs recovery dependencies among the various processes that would be unacceptable for a database server. Furthermore, the eventually restored global state is not necessarily the most recent, externally observed state. This is tolerable when restarting long-running distributed "number-crunching"-style computations, an initial target of this work, but would not mask all application failures from the human user (unless more stringent, forced logging were employed). It is exactly for this reason that this algorithmically deep work has had very little impact on real systems.
A variation of message logging that eliminates recovery dependencies is pessimistic message logging. Unfortunately, this approach is very conservative and thus expensive in that it forces every log record to disk immediately. In general, most of the research in this category ignored both the necessity to minimize logging I/O costs and the importance of log truncation for fast restart. Rather, it overemphasized communication costs, which is less of an issue with modem networks.
3. Persistent Queues: The third line of solutions restricts all interactions between processes to be via persistent queues. Here, when a process sends a message to another process, the sender explicitly enqueues the message to a persistent queue. This takes place within the boundaries of a distributed transaction involving the queue and the sender; so it incurs the high forced logging I/O costs of a two-phase commit protocol. Moreover, the same protocol is used when the receiver dequeues the message. This solution has been very successful in the context of transaction-structured applications such as reservation systems (including "pseudo-conversational" applications), and is even suitable for heterogeneous platforms. However, its disk I/O costs are very high, and applications must be decomposed completely into sequences of transactions with no application state outside of the queued messages.
Despite these efforts, there remains a need to improve application recovery techniques in client-server systems. Particularly, there is a need to attain reliable recovery, while minimizing the logging costs and enabling fast restart. The inventors have developed such application recovery techniques for client-server systems.