Transactions are familiar to us in everyday life when we exchange money for goods and services, such as buying a train ticket or paying for medicine. Such transactions involve a short conversation (for example, requesting availability and cost), and then making the payment. The processing of any one of these items is a business transaction of the kind that is handled by a transaction management system.
A transaction in the business sense can be viewed as an activity that must be completed in its entirety with a mutually agreed-upon outcome. It usually involves operations on some shared resources and results in overall change of state affecting some or all of those resources. When an activity or a transaction has been started and the mutually agreed outcome cannot be achieved, all parties involved in a transaction should revert to the state they were in before its initiation. In other words, all operations should be undone as if they had never taken place.
There are many examples of business transactions. A common one involves transfer of money between bank accounts. In this scenario, a business transaction would be a two-step process involving subtraction (debit) from one account and addition (credit) to another account. Both operations are part of the same transaction and both must succeed in order to complete the transaction. If one of these operations fails, the account balances must be restored to their original states.
Typically such transactions consist of many computing and data access tasks to be executed in one or more machines; the tasks may include handling the user interface, data retrieval and modification, and communications. In the example above, the money transfer operation is a transaction composed of debiting one account and crediting another.
In the context of business software, we can express the above more precisely. A transaction (sometimes also referred to as a ‘Unit-Of-Work’ or ‘UOW’) is a set of related operations that must be completed together. All their recoverable actions must either complete successfully or not at all. This property of a transaction is referred to as ‘atomicity’.
In the simplest case, a transaction will access resources connected to a single computer processor. Such a transaction is referred to as a ‘local transaction’. More often, however, a transaction will access resources which are located in several different computer processors, or logical partitions, each maintaining its own transaction log. Such a transaction is referred to as a ‘distributed transaction’.
When a distributed transaction ends, the atomicity property of transactions requires that either all of the processors involved commit the transaction or all of them abort the transaction. To achieve this one of the processors takes on the role of coordinator to ensure the same outcome at all of the parties to the transaction, using a ‘coordination protocol’ that is commonly understood and followed by all the parties involved. The two-phase commit protocol has been widely adopted as the protocol of choice in the distributed transaction management environment. This protocol guarantees that the work is either successfully completed by all its participants or not performed at all, with any data modifications being either committed together (completed) or rolled back (backed out) together on failure.
Another property of a transaction is its durability. This means that once a user has been notified of success, a transaction's outcome must persist, and not be undone, even when there is a system failure. A recovery manager is used to ensure that a system's objects are durable and that the effects of transactions are atomic even when the system crashes. The recovery manager saves information required for recovery purposes. This recovery can be for the dynamic backout of a transaction, perhaps as a result of a failure after a task updated a recoverable temporary storage queue. Additionally the recovery data can be used for restoring a transaction processing system to a committed state, for example when the system is restarted after system failure. Typically, the recovery file comprises at least one log containing the history of the transactions performed by a transaction processing system. In the event of a system failure, the recovery file can be played back to return the system to its state right before the failure, and the transaction log(s) used to check for and undo transactions that were not properly completed before failure.
Also, in the event of a transaction failure, the transaction log can be used to reverse updates that have been carried out by that transaction, by working backwards from the last change before the failure, hence the name ‘dynamic transaction backout’. The backout occurs before any locks on any affected resources are released, which safeguards other tasks from the possibility of using corrupted data, because modified data is not released for use by them (“committed”) until the current task has finished with it. In case the log needs to be replayed later in a system restart, an entry is first made in the log to indicate that that transaction is being backed out.
Examples of systems which carry out such transaction logging include transaction systems such as IBM® CICS® Transaction Server or IBM WebSphere® Application Server, as well as database systems such as IBM DB2® or IBM IMS™ (IBM, CICS, WebSphere, DB2 and IMS are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both).
The log typically records the information in the order that the activity occurs. Without some management, this would consume an ever increasing amount of resource. So it must be reorganised on a regular basis so as to reduce its size by the recovery manager carrying out a process called ‘keypointing’. Keypointing comprises writing current committed values of a system's objects to a new portion of the recovery file, together with transaction status entries and intentions lists for transactions that have not been fully resolved. An intentions list for a transaction contains a list of the references and the values of all the objects/resources that are altered by that transaction, as well as information related to the two-phase commit protocol. Once a keypoint has been taken, i.e. information stored through a keypointing procedure, recovery information for irrevocably committed (or backed out) transactions can usually be discarded from the log, sometimes called ‘trimming’ the log. This reduces the file size of the log as well as the number of transactions to be dealt with during recovery.
The rate that old log records are deleted by trimming should ideally match the rate at which new log data is being generated as new work enters the transaction system and makes its own recoverable changes to resources. In a well tuned on-line transaction processing (OLTP) environment, the volume of log data held on the log should remain reasonably constant over time, with an underlying periodic rise and fall in the number of log records held on the log as both the new work and the housekeeping work are performed.
This mechanism for log data deletion usually solves the problem of logs continually growing in size over time. However, a long-running UOW can prevent this mechanism from working as until such a UOW has completed, and its log data is no longer required, its data cannot be deleted from the log and all data logged since the first log entry of that UOW must be maintained on the log. Hence, the system will not be able to trim the log after keypointing, and the log will grow and grow in size. Eventually, when a critical threshold of the logging subsystem, the operating system, or the available hardware is exceeded, this will result in a request to write to the log failing for an “out of space” type of condition. When such an error is returned to the transaction system, it typically results in a serious failure. The system can no longer log any recoverable changes and so protect them from failures that require them to be backed out. Recovery processing (and hence data integrity) can no longer be guaranteed.
In many cases the transaction system will terminate, and an emergency restart of the system will be required in order to recover it to a consistent state once more. However, this process will very likely also fail since the log media is now full. The information for the uncommitted long-running UOW needs to be read back from the log in order to rebuild locks on the recoverable resources it was manipulating, and then drive backout processing to undo these changes. However, a transaction system cannot delete this data from the log until this backout has completed. Any new work that needs to log its recoverable changes will therefore fail with the same log-full condition as before, and the system will terminate once again. In such a situation, the only viable solution will probably be to scratch and redefine the log media, and restart the system ‘cold’. This avoids the need to refer to any old log data from the previous run of the system; the downside is that data integrity is now lost for the recoverable changes made by any active (i.e. uncommitted) work from the previous run.
Strategies aimed to avoid the occurrence of a log-full condition, such as monitoring log usage and taking steps to reduce the log size may be employed. However, even with such measures in place, there may still be occasions when the available log space becomes exhausted. The present invention aims to address these problems.