This invention relates to database computer systems and applications that execute on them. More parsicularly, this invention relates to methods for recovering from system crashes in a manner that ensures that the applications themselves persist across the crash.
Computer systems occasionally crash. A xe2x80x9csystem crashxe2x80x9d is an event in which the computer quits operating the way it is supposed to operate. Common causes of system crashes include power outage, application operating error, and computer goblins (i.e., unknown and often unexplained malfunctions that tend to plague even the best-devised systems and applications). System crashes are unpredictable, and hence, essentially impossible to anticipate and prevent.
A system crash is at the very least annoying, and may result in serious or irreparable damage. For standalone computers or client workstations, a local system crash typically results in loss of work product since the last save interval. The user is inconvenienced by having to reboot the computer and redo the lost work. For servers and larger computer systems, a system crash can have a devastating impact on many users, including both company employees as well as its customers.
Being unable to prevent system crashes, computer system designers attempt to limit the effect of system crashes. The field of study concerning how computers recover from system crashes is known as xe2x80x9crecovery.xe2x80x9d Recovery from system crashes has been the subject of much research and development.
In general, the goal of redo recovery is to return the computer system after a crash to a previous and presumed correct state in which the computer system was operating immediately prior to the crash. Then, transactions whose continuations are impossible can be aborted. Much of the recovery research focuses on database recovery for database computer systems, such as network database servers or mainframe database systems. Imagine the problems caused when a large database system having many clients crashes in the midst of many simultaneous operations involving the retrieval, update, and storage of data records. Database system designers attempt to design the database recovery techniques which minimize the amount of data lost in a system crash, minimize the amount of work needed following the crash to recover to the pre-crash operating state, and minimize the performance impact of recovery on the database system during normal operation.
FIG. 1 shows a database computer system 20 having a computing unit 22 with processing and computational capabilities 24 and a volatile main memory 26. The volatile main memory 26 is not persistent across crashes and hence is presumed to lose all of its data in the event of a crash. The computer system also has a non-volatile or stable database 28 and a stable log 30, both of which are contained on stable memory devices, e.g. magnetic disks, tapes, etc., connected to the computing unit 22. The stable database 28 and log 30 are presumed to persist across a system crash. The persistent database 28 and log 30 can be combined in the same storage, although they are illustrated separately for discussion purposes.
The volatile memory 26 stores one or more applications 32, which execute on the processor 24, and a resource manager 34. The resource manager 34 includes a volatile cache 36, which temporarily stores data destined for the stable database 28. The data is typically stored in the stable database and volatile cache in individual units, such as xe2x80x9cpages.xe2x80x9d A cache manager 38 executes on the processor 24 to manage movement of data pages between the volatile cache 36 and the stable database 28. In particular, the cache manager 38 is responsible for, deciding which data pages should be moved to the stable database 28 and when the data pages are moved. Data pages which are moved from the cache to the stable database are said to be xe2x80x9cflushedxe2x80x9d to the stable state. In other words, the cache manager 38 periodically flushes the cached state of a data page to the stable database 28 to produce a stable state of that data page which persists in the event of a crash, making recovery possible.
The resource manager 34 also has a volatile log 40 which temporarily stores computing operations to be moved into the stable log 30. A log manager 42 executes on the processor 24 to manage when the operations are moved from the volatile log 40 to the stable log 30. The transfer of an operation from the volatile log to the stable log is known as a log flush.
During normal operation, an application 32 executes on the processor 24. The resource manager receives requests to perform operations on data from the application. As a result, data pages are transferred to the volatile cache 36 on demand from the stable database 28 for use by the application. During execution, the resource manager 34 reads, processes, and writes data to and from the volatile cache 36 on behalf of the application. The cache manager 38 determines, independently of the application, when the cached Data State is flushed to the stable database 28.
Concurrently, the operations being performed by the resource manager on behalf of the application are being recorded in the volatile log 40. The log manager 42 determines, as guided by the cache manager and the transactional requirements imposed by the application, when the operations are posted as log records on the stable log 30. A logged operation is said to be xe2x80x9cinstalledxe2x80x9d when the versions of the pages containing the changes made by the operation have been flushed to the stable database.
When a crash occurs, the application state (i.e., address space) of any executing application 32, the data pages in volatile cache 36, and the operations in volatile log 40 all vanish. The computer system 20 invokes a recovery manager which begins at the last flushed state on the stable database 28 and replays the operations posted to the stable log 30 to restore the database of the computer system to the state as of the last logged operation just prior to the crash.
Explaining how to recover from a system crash requires answering some fundamental questions.
1. How can the designer be sure that recovery will succeed?
2. How can the stable state be explained in terms of what operations have been installed and what operations have not?
3. How should recovery choose the operations to redo in order to recover an explainable state?
4. How should the cache manager install operations via its flushing of database pages to the stable state in order to keep the state explainable, and hence recoverable?
The answers to these questions can be found in delicately balanced and highly interdependent decisions that a system designer makes.
One prior art approach to database recovery is to require the cache manager to flush the entire cache state periodically. The last such flushed state is identified in a xe2x80x9ccheckpoint recordxe2x80x9d that is inserted into the stable log. During recovery, a redo test is performed to determine whether a logged operation needs to be redone to help restore the system to its pre-crash state. The redo test is simply whether an operation follows the last checkpoint record on the log. If so (meaning that a later operation occurred and was posted to the stable log, but the results of the operation were not installed in the stable database), the computer system performs a redo operation using the log record.
This simple approach has a major drawback in that writing every change of the cached state out to the stable database 28 is practically unfeasible. It involves a high volume of input/output (I/O) activity that consumes a disproportionate amount of processing resources and slows the system operation. It also requires atomic flushing of multiple pages, which is a troublesome complication. This was the approach used in System R., described in: Gray, McJones, et al, xe2x80x9cThe Recovery Manager of the System R Database Manager,xe2x80x9d ACM Computing Surveys 13,2 (June, 1981) pages 223-242.
Another prior art approach to database recovery, which is more widely adopted and used in present-day database systems, involves segmenting data from the stable database into individual fixed units, such as pages. Individual pages are loaded into the volatile cache and logged resource manager operations can read and write only within the single pages, thereby modifying individual pages. The cache manager does not flush the page after every incremental change.
Each page can be flushed atomically to the stable database, and independently of any other page. Intelligently flushing a page after several updates have been made to the page produces essentially the same result as flushing each page after every update is made. That is, flushing a page necessarily includes all of the incremental changes made to that page leading up to the point when the flushing occurs.
The cache manager assigns a monotonically increasing state ID to the page each time the page is updated. During recovery, each page is treated as if it were a separate database. Resource manager operations posted to the stable log are also assigned a state ID. A redo test compares, for each page, the state ID of a stable log record with the state ID of the stable page. If the log record state ID is greater than the state ID of the stable page (meaning that one or more operations occurred later and were recorded in the stable log, but the page containing updates caused by the later operations was not yet flushed to the stable database), the computer system performs a redo operation using the last stable page and the operations posted to the stable log that have state IDs higher than the state ID of the stable page.
While these database recovery techniques are helpful for recovering data, in the database, the recovery techniques offer no help to recovering an application from a system crash. Usually all active applications using the database are wiped out during a crash. Any state in an executing application is erased and cannot usually be continued across a crash.
FIG. 2 shows a prior art system architecture of the database computer system 20. The applications 32(1)-32(N) execute on the computer to perform various tasks and functions. During execution, the applications interact with the resource manager 26, with each other, and with external devices, as represented by an end user terminal 44. The application states can change as a result of application execution, interaction with the resource manager 26, interaction with each other, and interaction with the terminal 44. In conventional systems, the application states of the executing applications 32(1)-32(N) are not captured. There is no mechanism in place to track the application state as it changes, and hence, there is no way to recover an application from a crash which occurs during its execution.
When the application is simple and short, the fact that applications are not recoverable is of little consequence. For example, in financial applications like debit/credit, there may be nothing to recover that was not already captured by the state change within the stable database. But this might not always be the case. Long running applications, which frequently characterize workflow systems, present problems. Like long transactions that are aborted, a crash interrupted application may need to be re-scheduled manually to bring the application back online. Applications can span multiple database transactions whereby following a system crash, the system state might contain an incomplete execution of the application. Cleanly coping with partially completed executions can be very difficult. One cannot simply re-execute the entire activity because the partially completed prior execution has altered the state. Further, because some state changes may have been installed in the stable database, one cannot simply undo the entire activity because the transactions are guaranteed by the system to be persistent. The transactions might not be undoable in any event because the system state may have changed in an arbitrary way since they were executed.
Accordingly, there is a need for recovery procedures for preserving applications across a system crash. Conceptually, the entire application state (i.e., the address space) could be posted to the stable log after each operation. This would permit immediate recovery of the application because the system would know exactly, from the last log entry for the application, the entire application state just prior to crash. Unfortunately, the address space is typically very large and continuously logging such large entries is too expensive in terms of I/O processing resources and the large amounts of memory required to hold successive images of the application state.
There are several prior art techniques that have been proposed for application recovery. All have difficulties that restrict their usefulness. One approach is to make the application xe2x80x9cstateless.xe2x80x9d Between transactions, the application is in its initial state or a state internally derived from the initial state without reference to the persistent state of the database. If the application fails between transactions, there is nothing about the application state that cannot be recreated based on the static state of the stored form of the application. Should the transaction abort, the application is replayed, thereby re-executing the transaction as if the transaction executed somewhat later. After the transaction commits, the application returns to the initial state. This form of transaction processing is described by Gray and Reuter in a book entitled, Transaction Processing: Concepts and Techniques, Morgan Kaufnann (1993), San Mateo, Calif.
Another approach is to reduce the application state to some manageable size and use a recoverable resource manager to store it. The resource manager might be a database or a recoverable queue. Reducing state size can be facilitated by the use of a scripting language for the application. In this case, the script language interpreter stores the entire application state at well-chosen times so that failures at inappropriate moments survive, and the application execution can continue from the saved point.
Another technique is to use a persistent programming language that logs updates to a persistent state. The idea is to support recoverable storage for processes. When the entire state of the application is contained in recoverable storage, the application itself can be recovered. Recoverable storage has been handled by supporting a virtual memory abstraction with updates to memory locations logged during program execution. If the entire application state is made recoverable, a very substantial amount of logging activity arises. This technique is described in the following publications: Chang and Mergen, xe2x80x9c801 Storage: Architecture and Programming,xe2x80x9d ACM Trans. on Computer Systems, 6, 1 (February 1988) pages 28-50; and Haskin et al., xe2x80x9cRecovery Management in QuickSilver,xe2x80x9d ACM Trans. on Computer Systems, 6,1 (February 1988) pages 82-108.
Another approach is to write persistent application checkpoints at every resource manager interaction. The notion here is that application states in between resource manager interactions can be re-created from the last such interaction forward. This is the technique described by Bartlett, xe2x80x9cA NonStop Kernel,xe2x80x9d Proc. ACM Symp. on Operating System Principles (1981) pages 22-29 and Borg et al. xe2x80x9cA Message System Supporting Fault Tolerance,xe2x80x9d Proc. ACM Symp. on Operating System Principles (October 1983) Bretton Woods, N.H. pages 90-99. The drawback with this approach is that short code sequences between interactions can mean frequent checkpointing of very large states as the state changes are not captured via operations, although paging techniques can be used to capture the differences between successive states at, perhaps, page level granularity.
The inventor has developed an improved application recovery technique.
This invention concerns a database computer system and method for making applications recoverable from system crashes. The application state (i.e., address space) is treated as a single object that can be atomically flushed in a manner akin to flushing individual pages in database recovery techniques. And like the pages of the database, log records describing application state changes are posted on the stable log before application state is flushed.
To enable this monolithic treatment of the application, executions performed by the application are mapped to loggable operations which are posted to the stable log. Any modifications to the application state are accumulated and the application state is flushed from time to time to stable storage using an atomic write procedure. Flushing the application state to stable storage effectively installs the application operations logged in the stable log. Since the application state can be very large, a procedure known as xe2x80x9cshadowingxe2x80x9d can be used to atomically flush the entire application state. As a result, the application recovery integrates with database recovery, and substantially reduces the need for checkpointing applications, i.e. logging or flushing the entire application state. According to one implementation, a database computer system has a processing unit, a volatile main memory that does not persist across a system crash, and a stable memory that persists across a system crash. The volatile memory includes a volatile cache which maintains cached states of the application address space and data records and a volatile log which tracks the operations performed by the computer system. The stable memory includes a stable database which stores stable states of the application address space and data records and a stable log which holds a stable version of the log records that describe state changes to the stable database.
The database computer system has at least one application which executes from the main memory on the processing unit. A resource manager is stored in main memory and mediates all interaction between the application and the external world (e.g., user terminal, data file, another application, etc.). During execution, the internal state changes of the application are not visible to the outside world. However, each time the application interacts with the resource manager, either the application state is exposed or the application senses the external state. The resource manager tags the application states at these interaction points by assigning them state IDs. Application operations are defined that produce the transitions between these application states. These operations are immediately entered into the volatile log, and subsequently posted to the stable log.
The application state is treated as a single object that can be atomically flushed to the stable database. In addition, the application operations often cause changes to the data pages, records, or other types of objects stored in the volatile cache. The modified objects that result from application operations are from time to time flushed to the stable database. The flushed application states and objects are assigned state IDs to identify their place in the execution sequence. Flushing the application object effectively installs all the operations, updating the application operations that are in the stable log which have earlier state IDs.
In the event of a system failure, the database computer system begins with the stable database state and replays the stable log to redo certain logged application operations. The database computer system redoes a logged application operation if its state ID is later in series than the state ID of the most recently flushed or already partially recovered application state.
Another aspect of this invention is to optimize the application read operation to avoid writing the object data read to the log record. Posting the read values to the log is helpful in one sense because the cache manager is not concerned about which sequence to flush objects. Certain object states need not be preserved by a particular flushing order because any data values obtained from an object which are needed to redo an application operation are available directly from the stable log. However, posting objects to the log often involves writing large amounts of data, and duplicating data found elsewhere on the system.
The read optimizing technique eliminates posting the read values to the log by substituting, for the read values, an identity of the location from where the values are read and posting the identity instead of the values. However, the data is now only available from the read object itself and hence, attention must be paid to the order in which objects are flushed to stable storage. If objects are flushed out of proper sequence, a particular state of an object may be irretrievably lost.
A cache manager has an object table which tracks the objects maintained in the volatile cache. The object table includes fields to track dependencies among the objects. In one implementation, the object table includes, for each object entry, a predecessor field which lists all objects that must be flushed prior to the subject object, and a successor field which lists all objects before which the subject object must be flushed. In another implementation, the object table contains, for each object entry, a node field to store dependencies in terms of their nodes in a write graph formulation.
Another aspect of this invention is to optimize the application write operation to avoid posting large amounts of data to the log record. Posting the values to be written is helpful in one sense because the cache manager is not concerned about which sequence to flush objects. However, the process is inefficient and costly in terms of computational resources.
The write optimization technique eliminates posting the write values to the log by substituting for those values, an identity of the object from where the values originate and posting the identity instead of the values. While this reduces the amount of data to be logged, the write optimization technique introduces dependencies between objects, and often troubling xe2x80x9ccyclexe2x80x9d dependencies when the read optimization technique is also being used, which can require atomic and simultaneous flushing of multiple objects.
The cache manager tracks dependencies via the object table and is configured to recognize cycle dependencies. When a cycle dependency is realized, the cache manager initiates a blind write of one or more objects involved in the cycle to place the objects"" values on the stable log. This process breaks the cycle. Thereafter, the cache manager flushes the objects according to an acyclic flushing sequence that pays attention to any predecessor objects that first require flushing.
Still another aspect of this invention is to optimize the recovery procedures invoked following a system crash. During normal operation, each log record is assigned a log sequence number (LSN). The cache manager maintains a recovery log sequence number (rLSN) that identifies the first log record for an associated object at which to begin replaying the operations during recovery. The cache manager occasionally flushes an object to non-volatile memory to install the operations performed on the object. On some occasions, the flushing of one object installs operations that wrote another data object that has not yet been flushed (i.e., an object that is unexposed in the write graph, meaning that its contents are not needed for recovery). The cache manager advances the rLSN for both objects to identify subsequent log records that reflect the objects at states in which the operations that previously wrote the states are installed in the non-volatile memory.
During recovery, the recovery manager starts at the advanced rLSNs to avoid replaying operations that are rendered unnecessary by subsequent operations, thereby optimizing recovery.