A transaction can be defined as a set of actions on a set of resources or some subset thereof, said actions including changes to those resources. The initial state of a set of resources that will be changed by a transaction is defined as being consistent, and so either implicitly or explicitly satisfy a set of consistency conditions (a.k.a. constraints or integrity rules). Each particular transaction includes one or more operations that may alter the resources (e.g. addition, subtraction, selection, exchange, or transformation). Once defined, the transaction creates a delimitible set of changes from initial conditions. Each change to the resources (short of the final change) creates an intermediate state of those resources, which often are not intended to be accessible to other transactions.
Under such an implementation, each transaction operates on the set of resources in an initial state and, after any operations performed by the transaction, leaves the set of resources in a final state. Thus a transaction may be viewed as a means of transforming a set of resources from an initial consistent state to a final consistent state (possibly, but generally not the same as the initial).
Transaction processing is subject to multiple difficulties. A transaction may use resources inefficiently. Transactions may fail to complete operations as designed. Errors may cause the final state to be inconsistent. Transactions may execute too slowly. Such difficulties can be handled manually if the environment is simple enough. Automated or semi-automated means (as supplied, for example, by a transaction management facility) are required in more sophisticated situations.
An environment in which transactions operate is often subject to a transaction management facility, often referred to simply as a “transaction manager.” The responsibility of a transaction manager is to ensure the initial and final states are consistent and that no harmful side effects occur in the event that concurrent transactions share resources (isolation). A transaction manager typically enforces the isolation of a specific transaction using a default concurrency control mechanism (e.g., pessimistic or optimistic). If a condition such as an error occurs before the final state is reached, it is often the responsibility of a transaction management facility to return the system to the initial state. This sort of automated transaction processing lies behind the greatest volume of financial and commercial transactions extant in modern society.
Automated transaction processing, both with and without transaction management facilities, has been designed traditionally with an unspoken assumption that errors are exceptional. The programming, both its design and coding, focuses on implementing transactions in a near-perfect world where it is permissible to simply start over and redo the work if anything goes wrong. Even if this were to model accurately the majority of automated commercial transactions, it would not reflect the entirety of any business's real world experience. In the real world, eighty percent or more of the management effort and expertise is about handling exceptions, mistakes, and imperfections. In automated transaction processing, error recovery mechanisms are usually seen as an afterthought, a final ‘check-box’ on the list of features and transactions that can be handled (if all goes perfectly).
A naïve approach to the implementation of complex automated transaction processing systems maintains that the system resulting from integrating (via transactional messaging) a set of applications that already have error recovery mechanisms will itself recover from errors. Experience and careful analysis have shown that nothing could be further from the truth. As more and more business functions are integrated, the problems of automated error recovery become increasingly important and complex. Errors can propagate just as rapidly as correct results, but the consequences can be devastating.
As more and more business functions are integrated, the problems of automated error recovery and resource management become increasingly important. It's only natural that many of the systems that a business automates first are deemed by that business to enable the execution of its core competencies, whose completion is ‘mission critical’. Automation demands the reliability we associate with transaction management if error recovery is to be robust. With each success at automating a particular business transaction, the value of connecting and integrating disparate automated transactions increases. Separate transactions, each of them simple, when connected become a complex transaction. With each integrative step, the need for acceptable error recovery becomes ever more important.
Traditional approaches to automated transaction management emphasize means to guarantee the fundamental properties of a properly defined or ‘formal’ transaction, which are atomicity, consistency, isolation, and durability. These properties are usually referred to by their acronym, ACID. Transactions, especially if complex, may share access to resources only under circumstances that do not violate these properties, although the degree to which transaction management facilities strictly enforce the isolation property is often at the discretion of the user.
It is not uncommon to refer to any group of operations on a set of resources (i.e., a unit of work) as a transaction, even if they do not completely preserve the ACID properties. In keeping with this practice, we will use the term transaction without a qualifying adjective or other modifier when referring a unit of work of any kind whether formal or not. We will use the qualified term pseudo-transaction when we want to refer specifically to a unit of work that does not preserve all of the ACID properties, although it may preserve some of them. Pseudo-transactions exist for a variety of reasons including the difficulty of proper transaction design and enforcement, incomplete knowledge of consistency rules, attempts to increase concurrency at the expense of decreased isolation, attempts to increase performance at the expense of atomicity, and so on.
The ACID properties lead to a very specific behavior when one or more of the elements that compose a transaction fail in a manner that cannot be transparently recovered (a so-called “unrecoverable error”): the atomicity property demands that the state of the resources involved be restored so that it is as though no changes whatsoever had been made by the transaction. Thus, an unrecoverable error always results in transitioning to the initial state (i.e., the initial state being restored), the typical process for achieving this being known as “rollback.” An alternative method of restoring the initial state is to run an “undo” or “inverse” transformation known as a compensating transaction (discussed in more detail below). This of course presumes that for such mandated compensating transactions, for every error it is possible to first identify the class of error, then most suitable compensating transaction, and finally to implement that compensating transaction. A problem with the current approach to enforcing atomicity is that viable work is often wasted when the initial state is recovered. A second problem is that transactions dependent on a failed transaction cannot begin until the failed transaction is resubmitted and finally completes, thereby possibly resulting in excessive processing times and perhaps ultimately causing a failure to achieve the intended business purpose.
The consistency property guarantees the correctness of transactions by enforcing a set of consistency conditions on the final state of every transaction. Consistency conditions are usually computable, which means that a software test is often executed to determine whether or not a particular consistency condition is satisfied in the current state. Thus, a correctly written transaction becomes one which, when applied to resources in a first consistent state, transforms those resources into a second (possibly identical) consistent state. Intermediate states, created as the component operations of a transaction are applied to resources, may or may not satisfy a set of consistency conditions and so may or may not be a consistent state. A problem with this approach is that consistency must be either cumulative during the transaction, or else enforced at transaction completion. In most cases, transactions are assumed to be written correctly and the completion of a transaction is simply assumed to be sufficient to insure a consistent state. This leads to a further problem: the interactions among a collection of transactions that constitute a complex transaction may not result in a consistent state unless all consistency rules are enforced automatically at transaction completion.
For complex transactions that share resources, the isolation property further demands that concurrent or dependent transactions behave as though they were run in isolation (or were independent): that is, no other transaction can have seen any intermediate changes (there are no “side effects”) because these might be inconsistent. The usual approach to ensuring the isolation property is to lock any resource that is touched by the transaction, thereby ensuring that other transactions cannot modify any such resource (a share lock) and cannot access modified resources (an exclusive lock). With regard to resource management, locking is used to implement a form of dynamic scheduling. The most commonly used means for ensuring this is implementing the rule known as “two-phase locking” wherein while a transaction is processing, locks on resources accessed by that transaction are acquired during phase one and are released only during phase two, with no overlap in these phases. Such an implementation guarantees that concurrent or dependent transactions can be interleaved while preserving the isolation property. A problem with this approach is that it necessarily increases the processing time of concurrent transactions that need to access the same resources, since once a resource is locked, it may not be modified by any other transaction until the locking transaction has completed. Another problem due to this approach is that it occasionally creates a deadly embrace or deadlock condition among a group of transactions. In the simplest case of the group consisting of only two transactions, each of the two transactions wait indefinitely for a resource locked by the other. Deadlock conditions can arise in complex ways among groups of more than two transactions. Other approaches to maintaining the isolation property include optimistic concurrency (such as time stamping) and lock or conflict avoidance (such as static scheduling via transaction classes or conflict graphs, nested transactions, and multi-versioning). Various caching schemes have been designed to improve concurrency by minimizing the time required to access a resource, while respecting a particular approach to enforcing the isolation property. Each of the existing approaches to enforcing isolation, and the associated techniques and implications for resource management, fails to meet the needs imposed by complex, possibly distributed, business transactions.
If no error occurs, the completion of the transaction guarantees not only a consistent state, but also a durable one (the durability property) through a process known as “commit.” The step in a transaction at which a “commit” is processed is known as the commit point. The durability property is intended to guarantee that the specific result of a completed transaction can be recovered at a later time, and cannot be repudiated. Ordinarily, the durability property is interpreted as meaning that the final state of resources accessed by a transaction is, in effect, recorded in non-volatile storage before confirming the successful completion of the transaction. Usually, this is done by recording some combination of resource states, along with the operations that have been applied to the resources in question. The software that handles this recording is called a resource manager.
A variant of the commit point, in which a user (possibly via program code) asserts to the transaction manager that they wish to make the then current state recoverable and may subsequently wish to rollback work to that known state, is known as a savepoint. Because savepoints are arbitrarily defined, they need not represent a consistent state. Furthermore, the system will return to a specific savepoint only at the explicit request of the user. Typically, savepoints are not durable. Savepoints cannot be asserted automatically by the system except in the most rudimentary fashion as, for example, after every operation or periodically based on elapsed time or quantity of resources used. None of these approaches enable the system to determine to which savepoint it should rollback.
When the elements of a transaction are executed (whether concurrent or sequential) under multiple, independent resource managers, the rollback and commit processes can be coordinated so that the collection behaves as though it were a single transaction. In essence, the elements are implemented as transactions in their own right, but are logically coupled to maintain ACID properties to the desired degree for the collection overall. Such transactions are called distributed transactions. The usual method for achieving this coordination is called two-phase commit. Unfortunately, this is an inefficient process which tends to reduce concurrency and performance, and cannot guarantee coordination under all failure conditions. Under certain circumstances, a system failure during two-phase commit can result in a state that is incorrect and that then requires difficult, costly, and time-consuming manual correction during which the system is likely to be unavailable. As with single transactions, compensating transactions can sometimes be used to restore the initial state of a collection of logically coupled transactions. In such cases, it may be necessary to run special compensating transactions that apply to the entire collection of transactions (known as a compensation sphere whether or not the collection is a distributed transaction).
There are numerous optimizations and variations on these techniques, including split transactions, nested transactions, and the like. In practice, all these approaches have several disadvantages (and differ from the present invention):
poor concurrency due to locking is common;
the cost of rollback, followed by redoing the transaction, can be excessive;
the conditions of consistency, isolation, and durability are tightly bound together;
logically dependent transactions must either (a) be run sequentially with the possibility that an intervening transaction will alter the final state of the first transaction before the second transaction can take over, or (b) be run together as a distributed transaction, thereby locking resources for a much longer time and introducing two-phase commit performance and concurrency penalties;
there is significant overhead in memory and processing costs on already complex transactions;
the errors which are encountered and identified are not recorded (which can complicate systematic improvement of a system);
it is often undesirable in a business scenario to return a set of resources to some prior state, especially when a partially ordered set of interdependent transactions (i.e., a business process) has been run;
it is not always possible to define a compensating transaction for a given transaction, and the best compensating transaction often depends on context;
business transactions may result in very long times from start to completion, and may involve many logically coupled transactions, possibly each running under separate transaction or resource managers; and, finally,
the transaction manager will not be able to compensate for or recover from certain context-dependent, external actions that affect resources external to the resource manager.
Transactions can be classified broadly into three types, with corresponding qualifiers or adjectives: physical, logical, and business. A physical transaction is a unit of recovery; that is, a group of related operations on a set of resources that can be recovered to an initial state as a unit. The beginning (and end) of a physical transaction is thus a point of recovery. A physical transaction should have the atomicity and durability properties. A logical transaction is a unit of consistency; that is, a group of related operations on a set of resources that together meet a set of consistency conditions and consisting of one or more coordinated physical transactions. The beginning (and end) of a logical transaction is a point of consistency. In principle, logical transactions should have the ACID properties. A business transaction is a unit of audit; that is, a group of related operations on a set of resources that together result in an auditable change and consisting of one or more coordinated transactions. If, as is the ideal construction, each of these component transactions are logical transactions, business transactions combine to form a predictable, well-behaved system. The beginning and end of a business transaction are thus audit points, by which we mean that an auditor can verify the transaction's identity and execution. Audit information obtained might include identifying the operations performed, in what order (to the degree it matters), by whom, when, with what resources, that precisely which possible decision alternatives were taken in compliance with which rules, and that the audit system was not circumvented. Business transactions can be composed of other business transactions. Time spans of a business transaction can be as short as microseconds or span decades (e.g., life insurance premium payments and eventual disbursement which must meet the consistency conditions imposed by law and policy).
The efficiency, correctness, and auditability of automated business transactions have a tremendous influence on a business' profitability. As transaction complexity increases, the impact of inefficiencies and errors increases combinatorially.
There are at least four general classes of ways that transactions can be complex. First, a transaction may involve a great deal of detail in its definition, each step of which may be either complex or simple, and may inherently require considerable time to process. Even if each individual step or operation is simple, the totality of the transaction may exceed the average human capacity to understand it in detail—for example, adding the total sum of money paid to a business on a given day, when the number of inputs are in the millions. This sort of complexity is inherently addressed (to the degree possible) by automation, and by following the well-known principles of good transaction design.
Second, a transaction may be distributed amongst multiple, separate environments, each such environment handling a sub-set of the total transaction. The set of resources may be divisible or necessarily shared, just as the processing may be either sequential or concurrent, and may be dependent or independent. Distributed transactions inherently impose complexity in maintaining the ACID properties and on error recovery.
Third, a transaction may be comprised of multiple, linked transactions—for example, adding all of the monies paid in together, adding all of the monies paid out together, and summing the two, to establish a daily net cashflow balance for a company. Such joined transactions may include as a sub-transaction any of the three complex transactions (including other joined transactions, in recursive iteration). And, of course, linked transactions may then be further joined, theoretically ad infinitum. Each sub transaction is addressed as its own transaction, and thus is handled using the same means and definitiveness. Linked transactions can become extremely complex due to the many ways they can be interdependent, thus making their design, maintenance, and error management costly and their use risky. Tremendous care must be taken to keep complexity under control.
Fourth, and last, a transaction may run concurrently in a mix of transactions (physical, logical, business, and pseudo). As the number of concurrent transactions, the number of inter-dependencies, or the speed of processing increase, or as the available resources decrease, the behavior of the transaction becomes more complex. Transaction managers, careful transaction design, and workload scheduling to avoid concurrency are among the methods that are used to manage this type of complexity, and provide only limited relief. Part of the problem is that the group behavior of the mix becomes increasingly unpredictable, and therefore unmanageable, with increasing complexity.
A business process may be understood as consisting of a set of partially-ordered inter-dependent or linked transactions (physical, logical, business, and pseudo), sometimes relatively simple and sometimes enormously complex, itself implementing a business transaction. The flow of a business process may branch or merge, can involve concurrent activities or transactions, and can involve either synchronous or asynchronous flows. Automated business process management is rapidly becoming the principal means of enabling business integration and business-to-business exchanges (e.g., supply chains and trading hubs).
Knowledge of both the internal logical structure of transactions and the interrelationships among a group of transactions is often represented in terms of an interconnected set of dependencies. Two types of dependency are important here: semantic and resource. If completion of an operation (or transaction) A is a necessary condition for the correct completion of some operation (or transaction) B, B is said to have semantic dependency on A. If completion of an operation (or transaction) T requires some resource R, transaction T is said to have a resource dependency on the resource R. Resource dependencies become extremely important to the efficiency of transaction processing, especially if the resource cannot be shared (that is, if a principle of mutual exclusion is either inherent or enforced). In such cases, transactions (or operations) that depend on the resource become serialized on that resource, and thus, transactions that require the resource depend on (and wait for) the completion of transaction that has the resource.
Dependencies are generally depicted via a directed graph, in which the nodes represent either transactions or resources and arrows represent the dependency relationship. The graph that represents transactions that wait for some resource held by another transaction, for example, is called a “wait graph.” Dependency graphs may be as simple as a dependency chain or even a dependency tree, or may be a very complex, and non-flat network.
The value of successfully managing complexity through automated means grows as the transactions being managed become more complex, as this uses computerization's principal strength: the capacity for managing tremendous amounts of detail, detail that would certainly overwhelm any single human worker, and threaten to overwhelm a human organization not equipped with computer tools.
Unfortunately, the cost of any error that may propagate, for example, down a dependency chain of simple transactions, or affect a net of distributed transactions, also increases. Moreover, the cost of identifying possible sources of error increases as the contextual background for a complex transaction broadens, as all elements, assumptions, and consequences of particular transition states that may be visited while the transaction is processing must be examined for error. One certainty is that the law of unintended consequences operates with harsh and potentially devastating impact on program designers and users who blithely assume that their processes will always operate exactly as they are intended, rather than exactly according to what they are told (and sometimes more telling, not told) to do.
Error-handling for complex transactions currently operates with a bias towards rescinding a flawed transaction and restoring the original starting state. Under this approach, only when a transaction has successfully and correctly completed is the computer program granted permission to commit itself to the results and permanently accept them. If an error occurs, then the transaction is rolled back to the starting point and the data and control restored. This “either commit or rollback” approach imposes a heavy overhead load on complex transaction processing. If the complex transaction is composed of a chain of single, simpler transactions, then the entire chain must be rolled back to the designated prior commit point. All of the work done between the prior commit point and the error is discarded, even though it may have been valid and correct. If the complex transaction is a distributed one, then all resources used or affected by the transaction must be tracked and blocked from other uses until a transaction has successfully attained the next commit point; and when a single part of the entire distributed transaction encounters an error, all parts (and the resources used) must be restored to the values established at the prior commit point. Again, the work that has been successfully performed, even that which is not affected by the error, must be discarded. With linked transactions or any mix involving possibly interdependent pseudo-transactions, no general solution to the problem of automated error recovery has heretofore been presented.
Furthermore, the standard approach treats all transactional operations as identical. Operations, however, differ as to their reversibility, particularly in computer operations. Addition of zero may be reversible by subtracting zero. But multiplication by zero, even though the result is boring, is not exactly reversible by division by zero. Non-commutable transactions are not differentiated from commutable ones, nor do they have more stringent controls placed around their inputs and operation.
A second method currently used for error-handling in complex transactions is the application, after an error, of a pre-established compensatory mechanism, also called (collectively) compensating transactions as noted above. This presumes that all errors experienced can be predetermined, fit into particular categories, and a proper method of correction devised for each category. Using compensating transactions introduces an inherent risk of unrecoverable error: compensating transaction may themselves fail. Dependence entirely on compensating transactions risks the imposition of a Procrustean solution on a correct transaction that has been mistakenly identified as erroneous, or even on an erroneous transaction where the correction asserted becomes worse than the error.
Inherent in the use of compensating transactions is an assumption that each individually defined transaction has a matching transaction (the “compensating transaction”) that will “undo” any work that the original transaction did. When transactions are treated in isolation or are applied sequentially, it is pretty easy to come up with compensating transactions. All that is needed is the state of the system saved from the beginning of the transaction and a function to restore that state. (In essence, this is how one recovers a file using a backup copy. All that is lost is the intermediate correct stages between preparation of the backup and the occurrence of the error.) When transactions become interleaved, this simplistic notion of a compensating transaction no longer works and the implementation a bit trickier. In fact, a compensating transaction may not even exist for certain transactions. The compensating transaction may be selected and applied automatically by the transaction manager. Still, the process is much the same: the system is ultimately returned to an earlier state or its equivalent.
Automated support for compensating transactions requires that, for each transaction, a corresponding compensating transaction be registered with an error management system so that recovery can take place automatically and consistently. The rules for using compensating transactions become more complex as the transaction model departs further from the familiar “flat” model. Formally, compensating transactions should always return a system to a prior state. If multiple systems are recovered, they are all recovered to prior states that share a common point in time. If the atomic actions that make up a transaction can be done in any order, and if each of these has an undo operation, then such a compensating transaction can always be defined. Three guidelines have been published (McGoveran, 2000): (1) Try to keep the overall transaction model as close as possible to the traditional “flat” model or else a simple hierarchy of strictly nested transactions. (2) Design the atomic actions so that order of application within a transaction does not matter. (3) Make certain that compensating transactions are applied in the right order.
A transaction logically consists of a begin transaction request, a set of steps or operations, each typically (though not necessarily) processed in sequential order of request and performing some manipulation of identified resources, and a transaction end request (which may be, for example, a commit, an abort, a rollback to named savepoint, and the like). Because the state of the art typically processes each step in the order received, the management of affected resources is largely cumulative rather than either predetermined or predictive, even when the entire transaction is submitted at one time. Resource management, and in particular the scheduling of both concurrent transactions and the operations of which they are composed, may be either static or dynamic. Static scheduling uses various techniques such as conflict graphs to determine in advance of execution which transactions and operations may be interleaved or run concurrently. Dynamic scheduling uses various techniques such as locking protocols to determine at execution time which transactions and operations may be interleaved or run concurrently.