1. Technical Field
The present invention relates to fault tolerant transaction-oriented data processing, and in particular to an efficient method and system for supporting recoverable resource updates within transactions such as in a transaction-oriented messaging system, file system or database system.
2. Description of the Related Art
Many business functions can be implemented by transaction processing using application-oriented computer programs. Most application-oriented programs need to access some form of computer system facilities (facilities such as processors, databases, files,input/output devices, other application programs)--which are generically known as resources. The system software which controls these resources is generically known as the resource manager. A common processing requirement is to be able to make a coordinated set of changes to one or more resources (and in particular to collections of data objects), such that either all of the changes take effect, and the resources are moved to a new consistent state, or none of them does. This property of all-or-none processing is known as "atomicity".
As pointed out by C. J. Date in "An Introduction to Data Base Systems", Vol. 1, 4th Edition, Addison-Wesley Publishing Co., 1986, Ch.18, a "transaction" is a logical unit of work referencing a sequence of associated operations that transforms a consistent state of one or more recoverable resources into another consistent state (without necessarily preserving consistency at all intermediate points). Transaction processing is the management of discrete units of work that access and update shared data.
It is known to provide fault-tolerant transaction processing systems which can maintain data consistency and transaction atomicity over system, media and transaction failures (an example of the latter being application detected error conditions leading to the inability of a transaction to complete). To enable recovery of resources to a consistent state following a failure, it is necessary for the system to keep a record of the state of system resources at the time of the failure, which includes knowing which transactions had been completed, to enable the completed transactions to be performed again, and which transactions were in progress to enable the operations within an uncompleted transaction to be undone. A transaction thus defines a unit of recovery as well as a unit of work.
It is frequently a processing requirement for resource updates to be made within a transaction without the delay of verifying, prior to making updates, whether the transaction can complete successfully. For atomicity and consistency in such systems, it is thus necessary to provide a backward recovery facility for when transactions fail to complete successfully--enabling all changes that have been made to resources during the partial execution to be removed.
The restoration of resources to the consistent state which existed before a transaction began is known as ROLLBACK (or synonymously as BACKOUT; a total ROLLBACK of a transaction being known as an ABORT) of the transaction, the changes generally being removed in the reverse chronological order from which they were originally made.
One example of a system which may employ transactional techniques is a messaging and queuing system described in the document "IBM Messaging and Queuing Series--technical Reference" (SC33-0850-01, 1993). Messaging and queuing provides a number of facilities for high-level interprogram communication. It encompasses:
Messaging--a simple means of program-to-program communication that hides communication protocols. PA1 Queuing--the deferred delivery of messages. This enables asynchronous communication between processes that may not be simultaneously active, or for which no data link is active. The messaging and queuing service can guarantee subsequent delivery to the target application. PA1 Message driven processing--the accomplishment of an application task by the flow of messages to a number of processes in distributed system. The processes work together by accessing queued messages and generating new messages until the application task is completed. PA1 In a financial application that transfers funds from one account to another at the same location, there are two basic operations that need to be carried out: the debit of one account, and the credit of the other. Normally both of the operations succeed, but if one fails, both must fail. PA1 The failure might be for operational reasons (for example, one queue being temporarily unavailable), in which case the transaction can be presented again later. Alternatively, the failure might be because there are insufficient funds in the account to be debited, in which case a suitable response must be returned to the initiator of the transaction. PA1 The application PA1 The queue manager PA1 Other resource managers PA1 Message too big for queue PA1 Queue full PA1 Put requests inhibited for queue.
Most applications need to access resources of one form or another, and a common requirement is to be able to make a co-ordinated set of changes to two or more resources. "Co-ordinated" means that either all of the changes made to the resources take effect, or none of them does.
Queues are no exception to this--applications need to be able to get and put messages (and possibly update other resources, such as databases), and know that either all of the operations take effect, or that none of them does. The set of operations involved in this is called a transaction or unit of work. The following example illustrates this:
The debiting of one account and the crediting of the other constitute a unit of work.
A unit of work starts when the first recoverable resource is affected. For message queuing, this means when a message is got or put as part of a unit of work. The unit of work ends either when the application ends, or when the application declares a syncpoint (see below.) If the work is ended by the application declaring a syncpoint, another unit of work can then start, so that one instance of an application can be involved with several sequential units of work.
Each get or put operation can separately participate in the current unit of work. The application chooses which operations participate by specifying the appropriate "syncpoint" or "no-syncpoint" option on MQGET, MQPUT and MQPUT1 calls. If neither option is specified, participation of the call within the current unit of work is determined by the environment.
The application ends a unit of work by declaring a syncpoint. When a syncpoint is declared, any party that has an interest in the unit of work can vote "no" and so cause the unit of work to be backed out; this has the effect of undoing all of the changes that were made as part of the unit of work. If all parties vote "yes", the unit of work is committed, and the changes that were made as part of the unit of work become permanent. Parties interest in the unit of work can be:
The application declares a syncpoint, and registers its vote, by issuing the appropriate environment-dependent call.
If a message is put as part of a unit of work, the message does not become generally available for retrieval by applications until that unit of work is committed successfully. The one exception to this is the application which put the message. It can retrieve the message from the destination queue as part of the original unit of work before that unit of work is committed; the destination queue must be a local queue for this to be possible.
If the destination queue belongs to a remote queue manager, the message is not available to be sent from the local queue manager until the unit of work is committed. This means that it is not possible to send a request message and receive the reply to that request as part of the same unit of work; the unit of work containing the request message must be committed before it is possible to receive the reply.
Any errors detected by the queue manager when the message is put are returned to the application immediately by means of the completion code and reason code parameters. Errors that can be detected in this way include:
Failure to put the message does not affect the status of the unit of work (because that message is not part of the unit of work); the application can still commit or back out the unit of work, as required.
However, if a message that was put successfully as part of the unit of work causes an error when the application attempts to commit the unit of work, the unit of work is backed out.
If a message is retrieved as part of a unit of work, the message is removed from the queue and so is no longer available to other applications. However, the message is not discarded; it is retained by the queue manager until the unit of work is either committed or backed out.
If the unit of work is committed successfully, the queue manager then discards the message. However, if the unit of work is backed out the message is reinstated in the queue in its original position, and so becomes available once again to be browsed or retrieved by the same or another application.
Units of work should not be confused with the property of messages known as persistence. Message persistence defines whether or not the message survives failures and restarts of the queue manager.
Units of work can be used to protect against failures of the application, or against failures of other resource managers operating within the same unit of work; in this context, "failures" can include program-declared failures as well as error situations. Message persistence, on the other hand, protects against failures of the queue manager.
Many applications that use units of work will also want to use persistent messages. However, there are some situations in which it may be beneficial to use one without the other. For example, if the application contains logic to recover after a failure of the queue manager, using units of work with nonpersistent messages gives a performance benefit in the normal, nonfailure case. This combination can also be used to ensure that a final (nonpersistent) message is not sent if the unit of work is backed out before it reaches the syncpoint.
Returning to the general discussion, to provide the information which is necessary for recovery from system and transaction failures, all actions performed on recoverable data are generally recorded in a recovery log. This log is a persistent store for variable length records, which can be written to at its end only, but can be read in any order. Typically, there is written for each resource updating operation an UNDO-log record and a REDO-log record, the former indicating the old state of the resource and the latter indicating the new state of the resource. In the event of a failure, the progress state of a unit of work determines which records will be used for recovery: the UNDO log records will be read for transactions that were uncompleted at failure to permit restoration of the system resources to the state that existed before the transaction began, whereas the REDO records will be read for transactions that were completed to return the resources to the state which existed after the transaction's updates had been made. The log, however, may employ "transition logging" which requires only one log entry--the difference between before and after states--for each resource update. Not all resources need to be made recoverable, and hence it is known to be able to define resources as either "persistent" (non-volatile) or "non-persistent" (volatile).
In many of the known systems, log records are also read at successful completion of a transaction to determine which operations were performed within the transaction and so to determine which resource updates can now be confirmed as permanent: the resource updating application performs a COMMIT operation to confirm all updates in the successfully completed transaction.
If the contents of a resource (e.g., a database) were lost due to media failure, it would be possible to recreate the resource if all REDO log records since the resource was created were saved and available. However, to limit the amount of log information which must be read and processed on system restart following a failure (and therefore reduce the cost and improve the speed of recovery), a non-volatile copy of the resource may be made periodically (either at regular time intervals or after a predetermined amount of system activity) and saved, the log position at the time the copy is made being noted. This is known as taking a checkpoint. Then, if a failure occurs, the recovery log is processed from the noted position, the state of resources at the time of the most recent copy then serving as the initialising information from which resources are recovered. The REDO records from that point, representing all subsequent actions to the resource, are reprocessed against the saved copy of the resource.
Methods to reduce the overhead of logging are described by C. Mohan et al in "ARIES: A Transaction Recovery Method Supporting Fine Granularity Locking and Partial Rollbacks Using Write-Ahead Logging", IBM Research Report RJ6649 (Computer Science), Jan. 19, 1989. The ARIES recovery method keeps track of changes made to resources using a log. In addition to logging update activities performed during forward processing of transactions, logs are also written of resource changes performed during total or partial rollbacks of transactions during both normal processing and restart processing. Partial rollbacks to an intermediate checkpoint within the transaction are supported.
The log records written for the backout of an operation are known as Compensating Log Records (CLR). In ARIES each CLR contains, in addition to a description of the compensating action, a pointer to that transaction's log record which precedes the one that the CLR compensates. This pointer allows determination of precisely how much of the transaction has not been undone so far. Since CLRs are available to describe what actions are available during undo, the undo action need not be the exact inverse of the action that is being compensated (i.e., logical undo is possible).
During restart following an abnormal termination of the transaction (e.g., after a system failure), the log is scanned, starting from the first record of the last complete checkpoint, up to the end of the log. During this first "analysis" pass, information about pages that were potentially more up to date in the buffers than in the permanent version of the data resource and transactions that were in progress at the time of the termination is gathered. Then updates that did not get written to nonvolatile storage before the termination are repeated for all transactions, including for those transactions that were in progress at the time of the crash.
This essentially re-establishes the state of resources as of the time of the crash, as far as the actions represented in the log as of the crash time are concerned. No logging is done of the updates redone during this REDO pass.
The next pass is the UNDO pass during which all in progress transactions' updates are rolled back in reverse chronological order, in a single sweep of the log. For those transactions that were already rolling back at the time of the crash, only those actions which had not already been undone will be rolled back. This means that actions recorded in CLRs are never undone (i.e., CLRs are not compensated). This is possible because such transactions are redone and since the last CLR written for each transaction points to the next non-CLR record that is to be undone.
ARIES does not require the forcing of modified pages to non-volatile storage during any of this processing. It is also possible to take checkpoints during recovery. No locks have to be acquired during transaction rollback, thereby preventing rolling back transactions from getting involved in deadlocks.
On some operating systems, applications are not provided with operating system privileges. In such cases selective scanning of log records and possibly also writing of logs to disk are relatively inefficient aspects of the transaction processing, and so BACKOUT operations will be inefficient if the known methods of logging and reading from the log are used to support data consistency. The same also applies to COMMIT operations in some of the known systems. However, in other systems processing is carried out with COMMIT presumed to occur at resolution of a transaction (see Mohan, Lindsay, Obermarck, "Transaction Management in the R Distributed Database Management System", ACM Transactions on Database Systems", Vol. 11, No. 4, December 1986). Then log records need not be scanned for COMMIT processing, but only for the exceptional case of BACKOUT following a failure.
Increasingly, there is a need to provide recovery facilities which do not suffer from the limitations of particular operating systems' inefficient log scanning if efficient fault-tolerant data processing is to be supported for such operating systems.