Computer systems occasionally crash. A “system crash” is an event in which the computer quits operating the way it is supposed to operate. Common causes of system crashes include power outage, application operating error, and computer goblins (i.e., unknown and often unexplained malfunctions that tend to plague even the best-devised systems and applications). System crashes are unpredictable, and hence, essentially impossible to anticipate and prevent.
A system crash is at the very least annoying, and may result in serious or irreparable damage. For standalone computers or client workstations, a local system crash typically results in loss of work product since the last save interval. The user is inconvenienced by having to reboot the computer and redo the lost work. For servers and larger computer systems, a system crash can have a devastating impact on many users, including both company employees as well as its customers.
Being unable to prevent system crashes, computer system designers attempt to limit the effect of system crashes. The field of study concerning how computers recover from system crashes is known as “recovery.” Recovery from system crashes has been the subject of much research and development.
In general, the goal of redo recovery is to return the computer system after a crash to a previous and presumed correct state in which the computer system was operating immediately prior to the crash. Then, transactions whose continuations are impossible can be aborted. Much of the recovery research focuses on database recovery for database computer systems, such as network database servers or mainframe database systems. Imagine the problems caused when a large database system having many clients crashes in the midst of many simultaneous operations involving the retrieval, update, and storage of data records. Database system designers attempt to design the database recovery techniques which minimize the amount of data lost in a system crash, minimize the amount of work needed following the crash to recover to the pre-crash operating state, and minimize the performance impact of recovery on the database system during normal operation.
FIG. 1 shows a database computer system 20 having a computing unit 22 with processing and computational capabilities 24 and a volatile main memory 26. The volatile main memory 26 is not persistent across crashes and hence is presumed to lose all of its data in the event of a crash. The computer system also has a non-volatile or stable database 28 and a stable log 30, both of which are contained on stable memory devices, e.g. magnetic disks, tapes, etc., connected to the computing unit 22. The stable database 28 and log 30 are presumed to persist across a system crash. The persistent database 28 and log 30 can be combined in the same storage, although they are illustrated separately for discussion purposes.
The volatile memory 26 stores one or more applications 32, which execute on the processor 24, and a resource manager 34. The resource manager 34 includes a volatile cache 36, which temporarily stores data destined for the stable database 28. The data is typically stored in the stable database and volatile cache in individual units, such as “pages.” A cache manager 38 executes on the processor 24 to manage movement of data pages between the volatile cache 36 and the stable database 28. In particular, the cache manager 38 is responsible for deciding which data pages should be moved to the stable database 28 and when the data pages are moved. Data pages which are moved from the cache to the stable database are said to be “flushed” to the stable state. In other words, the cache manager 38 periodically flushes the cached state of a data page to the stable database 28 to produce a stable state of that data page which persists in the event of a crash, making recovery possible.
The resource manager 34 also has a volatile log 40 which temporarily stores computing operations to be moved into the stable log 30. A log manager 42 executes on the processor 24 to manage when the operations are moved from the volatile log 40 to the stable log 30. The transfer of an operation from the volatile log to the stable log is known as a log flush.
During normal operation, an application 32 executes on the processor 24. The resource manager receives requests to perform operations on data from the application. As a result, data pages are transferred to the volatile cache 36 on demand from the stable database 28 for use by the application. During execution, the resource manager 34 reads, processes, and writes data to and from the volatile cache 36 on behalf of the application. The cache manager 38 determines, independently of the application, when the cached Data State is flushed to the stable database 28.
Concurrently, the operations being performed by the resource manager on behalf of the application are being recorded in the volatile log 40. The log manager 42 determines, as guided by the cache manager and the transactional requirements imposed by the application, when the operations are posted as log records on the stable log 30. A logged operation is said to be “installed” when the versions of the pages containing the changes made by the operation have been flushed to the stable database.
When a crash occurs, the application state (i.e., address space) of any executing application 32, the data pages in volatile cache 36, and the operations in volatile log 40 all vanish. The computer system 20 invokes a recovery manager which begins at the last flushed state on the stable database 28 and replays the operations posted to the stable log to restore the database of the computer system to the state as of the last logged operation just prior to the crash.
Explaining how to recover from a system crash requires answering some fundamental questions.
1. How can the designer be sure that recovery will succeed?
2. How can the stable state be explained in terms of what operations have been installed and what operations have not?
3. How should recovery choose the operations to redo in order to recover an explainable state?
4. How should the cache manager install operations via its flushing of database pages to the stable state in order to keep the state explainable, and hence recoverable?
The answers to these questions can be found in delicately balanced and highly interdependent decisions that a system designer makes.
One prior art approach to database recovery is to require the cache manager to flush the entire cache state periodically. The last such flushed state is identified in a “checkpoint record” that is inserted into the stable log. During recovery, a redo test is performed to determine whether a logged operation needs to be redone to help restore the system to its pre-crash state. The redo test is simply whether an operation follows the last checkpoint record on the log. If so (meaning that a later operation occurred and was posted to the stable log, but the results of the operation were not installed in the stable database), the computer system performs a redo operation using the log record.
This simple approach has a major drawback in that writing every change of the cached state out to the stable database 28 is practically unfeasible. It involves a high volume of input/output (I/O) activity that consumes a disproportionate amount of processing resources and slows the system operation. It also requires atomic flushing of multiple pages, which is a troublesome complication. This was the approach used in System R., described in: Gray, McJones, et al, “The Recovery Manager of the System R Database Manager,” ACM Computing Surveys 13,2 (June, 1981) pages 223–242.
Another prior art approach to database recovery, which is more widely adopted and used in present-day database systems, involves segmenting data from the stable database into individual fixed units, such as pages. Individual pages are loaded into the volatile cache and logged resource manager operations can read and write only within the single pages, thereby modifying individual pages. The cache manager does not flush the page after every incremental change.
Each page can be flushed atomically to the stable database, and independently of any other page. Intelligently flushing a page after several updates have been made to the page produces essentially the same result as flushing each page after every update is made. That is, flushing a page necessarily includes all of the incremental changes made to that page leading up to the point when the flushing occurs.
The cache manager assigns a monotonically increasing state ID to the page each time the page is updated. During recovery, each page is treated as if it were a separate database. Resource manager operations posted to the stable log are also assigned a state ID. A redo test compares, for each page, the state ID of a stable log record with the state ID of the stable page. If the log record state ID is greater than the state ID of the stable 1 page (meaning that one or more operations occurred later and were recorded in the stable log, but the page containing updates caused by the later operations was not yet flushed to the stable database), the computer system performs a redo operation using the last stable page and the operations posted to the stable log that have state IDs higher than the state ID of the stable page.
While these database recovery techniques are helpful for recovering data, in the database, the recovery techniques offer no help to recovering an application from a system crash. Usually all active applications using the database are wiped out during a crash. Any state in an executing application is erased and cannot usually be continued across a crash.
FIG. 2 shows a prior art system architecture of the database computer system 20. The applications 32(1)–32(N) execute on the computer to perform various tasks and functions. During execution, the applications interact with the resource manager 26, with each other, and with external devices, as represented by an end user terminal 44. The application states can change as a result of application execution, interaction with the resource manager 26, interaction with each other, and interaction with the terminal 44. In conventional systems, the application states of the executing applications 32(1)–32(N) are not captured. There is no mechanism in place to track the application state as it changes, and hence, there is no way to recover an application from a crash which occurs during its execution.
When the application is simple and short, the fact that applications are not recoverable is of little consequence. For example, in financial applications like debit/credit, there may be nothing to recover that was not already captured by the state change within the stable database. But this might not always be the case. Long running applications, which frequently characterize workflow systems, present problems. Like long transactions that are aborted, a crash interrupted application may need to be re-scheduled manually to bring the application back online. Applications can span multiple database transactions whereby following a system crash, the system state might contain an incomplete execution of the application. Cleanly coping with partially completed executions can be very difficult. One cannot simply re-execute the entire activity because the partially completed prior execution has altered the state. Further, because some state changes may have been installed in the stable database, one cannot simply undo the entire activity because the transactions are guaranteed by the system to be persistent. The transactions might not be undoable in any event because the system state may have changed in an arbitrary way since they were executed.
Accordingly, there is a need for recovery procedures for preserving applications across a system crash. Conceptually, the entire application state (i.e., the address space) could be posted to the stable log after each operation. This would permit immediate recovery of the application because the system would know exactly, from the last log entry for the application, the entire application state just prior to crash. Unfortunately, the address space is typically very large and continuously logging such large entries is too expensive in terms of I/O processing resources and the large amounts of memory required to hold successive images of the application state.
There are several prior art techniques that have been proposed for application recovery. All have difficulties that restrict their usefulness. One approach is to make the application “stateless.” Between transactions, the application is in its initial state or a state internally derived from the initial state without reference to the persistent state of the database. If the application fails between transactions, there is nothing about the application state that cannot be re-created based on the static state of the stored form of the application. Should the transaction abort, the application is replayed, thereby re-executing the transaction as if the transaction executed somewhat later. After the transaction commits, the application returns to the initial state. This form of transaction processing is described by Gray and Reuter in a book entitled, Transaction Processing: Concepts and Techniques, Morgan Kaufmann (1993), San Mateo, Calif.
Another approach is to reduce the application state to some manageable size and use a recoverable resource manager to store it. The resource manager might be a database or a recoverable queue. Reducing state size can be facilitated by the use of a scripting language for the application. In this case, the script language interpreter stores the entire application state at well-chosen times so that failures at inappropriate moments survive, and the application execution can continue from the saved point.
Another technique is to use a persistent programming language that logs updates to a persistent state. The idea is to support recoverable storage for processes. When the entire state of the application is contained in recoverable storage, the application itself can be recovered. Recoverable storage has been handled by supporting a virtual memory abstraction with updates to memory locations logged during program execution. If the entire application state is made recoverable, a very substantial amount of logging activity arises. This technique is described in the following publications: Chang and Mergen, “801 Storage: Architecture and Programming,” ACM Trans. on Computer Systems, 6, 1 (February 1988) pages 28–50; and Haskin et al., “Recovery Management in QuickSilver,” ACM Trans. on Computer Systems, 6,1 (February 1988) pages 82–108.
Another approach is to write persistent application checkpoints at every resource manager interaction. The notion here is that application states in between resource manager interactions can be re-created from the last such interaction forward. This is the technique described by Bartlett, “A NonStop Kernel,” Proc. ACM Symp. on Operating System Principles (1981) pages 22–29 and Borg et al. “A Message System Supporting Fault Tolerance,” Proc. ACM Symp. on Operating System Principles (October 1983) Bretton Woods, NH pages 90–99. The drawback with this approach is that short code sequences between interactions can mean frequent checkpointing of very large states as the state changes are not captured via operations, although paging techniques can be used to capture the differences between successive states at, perhaps, page level granularity.
The inventor has developed an improved recovery technique that breaks apart flush dependencies that require atomic flushing of more than one object simultaneously. This enables an ordered flushing sequence of first flushing a first object and then flushing a second object, rather than having to flush both the first and second objects simultaneously and atomically.