In the field of transaction processing, transactions are typically short lived computations that have a well defined beginning and end. Various protocols have been invented to ensure that all the participants in a transaction agree on how to terminate the transaction, most being based on the so-called two phase commit (2PC) protocol.
For instance, multiple computers and multiple processes may participate in the computation initiated when a clerk or travel agent enters an airline reservation into an airline reservation system. After all the necessary data records in the distributed airline reservation system have been created or updated and all the associated computations and input/output operations have been completed, the transaction terminates using a "commit" protocol that ensures that all the transaction's participants (i.e., the various computer processes working on the transaction) agree that the transaction has been successfully completed and can be permanently stored. A similar set of events occurs when a bank teller enters a deposit or withdrawal at the teller's workstation. The duration of such transactions is typically very short, meaning a duration on the order of seconds, and possibly much shorter than a second.
This document is concerned with transactions and computations that have long durations. An example of such a computation is one which collects data from a large number of sources, and then integrates that data in some way. The data collection process involves numerous interactions with various pieces of hardware, and the duration of the computation may be extended, depending on the availability of all the required participating computers and other pieces of hardware. Another example of a long running computation might be the ongoing control process for forming various batches of parts in a steel mill. If the process of handling each batch of parts is considered to be a single computation, the duration of that computation will be dictated by the duration of the steel mill's physical processing steps.
In all transaction processing systems, for both short and long lived computations, an important consideration is recovering from system failures. It is essential in all modern transaction processing systems to be able to automatically recover from virtually any system failure once the system is brought back on line. This means that the system must store sufficient data to determine what its state was just prior to the system failure, and to re-initiate processing of all interrupted transactions with as little backtracking as possible.
Typically, in most transaction processing systems, system recovery is implemented by restarting all interrupted transactions at those transactions'beginning. Log records are stored at the beginning and end of each such transaction, enabling a system failure recovery routine to determine which transactions have been completed and which were in mid-process when a system failure occurred. This solution is not suitable for systems handling long running computations, since that recovery method would mean the redoing of much valuable work. An additional problem that distinguishes long running and short lived transactions is the problem of keeping sufficient records concerning the status of each transaction. For short lived transactions, it is generally sufficient to generate and store log records (A) marking the beginning of each transaction and recording sufficient data to restart that transaction, (B) recording changes made to various data structures so that those changes can be reversed if necessary, and (C) marking the conclusion of the transaction once the results of the transaction have been permanently stored. For long running transactions, backing up the system to undo all the work performed by the transaction up to the point of a system failure will typically be much more involved and in some cases may be virtually impossible.
Another problem associated with long lived transactions concerns the use of data interlock mechanisms. In order to prevent two different transactions or computations from accessing and making inconsistent changes to a record in a database or to any other specified object, most multitasking computer systems provide interlock mechanisms that allow one transaction to have exclusive use of a specified object until the transaction either completes or explicitly releases its lock on the object. In most cases, a transaction maintains a lock on each object used by the transaction until either the transaction commits and its results are permanently stored, or the transaction aborts and any interim changes are reversed. The problem associated with long lived transactions is that locking the objects used by each transaction for a long period of time can result in system deadlock, where many transactions are unable to proceed because other long lived transactions have locks on objects needed by the blocked transactions. Clearly, the extent of the deadlock problem is related to the average number of objects used by each transaction and the average amount of overlap between transactions as to the objects used by those transactions. Nevertheless, the time duration of long lived transactions greatly increases the chances that transactions competing for resources will be delayed for significant periods of time.
One additional problem associated with long lived transactions that is not a problem with short lived transactions concerns tracking those transactions. For short lived transactions, it is generally sufficient to know that each transaction is either in process, in process but blocked from proceeding because a required resource is not available, aborted, or completed. However, for long lived transactions it is important to monitor the status of each transaction at a much greater level of detail.
In summary, problems that distinguish long lived transactions from short lived transactions are recovering interrupted transactions, deadlocks caused by data interlocks, and the need to be able to track or monitor the status of transactions that are in process.