In the field of transaction processing, transactions are typically short lived computations that have a well defined beginning and end. Various protocols have been invented to ensure that all the participants in a transaction agree on how to terminate the transaction, most being based on the so-called two phase commit (2PC) protocol.
Important features of transaction processing systems is reliablity and atomicity. Reliability concerns things such as ensuring that the state of a system can be recovered after a failure, and that all interrupted transactions can be restarted or otherwise handled so that the system failure does not produce unpredicatble results. Atomicity means that every transaction is treated as an indivisble unit that is either successful, in which case the results of the transaction are durably stored, or aborted, in which case all data affected by the transaction are returned to their state prior to initiation of the transaction.
Work flow management typically involves processes, such as business activities, that have durations of minutes, hours, or even days, and therefore have much longer durations that the discrete transactions handled by traditional transaction processing systems. Work flow management also differs from traditional transaction processing systems in that a typical work flow may involve not only multiple computers or other machines, but also the participation of multiple human principals.
This document is concerned with long-lived activities, such as multi-user computations and business processes. Such activies are sometimes known as work flows. An example a work flow is one which collects data from a large number of sources, and then integrates that data in some way. The data collection process involves numerous interactions with various pieces of hardware and/or human principals, and the duration of the work flow may be extended, depending on the availability of all the required participating computers and other pieces of hardware. Another example of a long running work flow is the process of composing a newspaper edition, which requires cooperative efforts by many persons as well as by computers and other machinery.
It is a premise of the present invention that an important consideration in any activity management system is recovering from system failures. The activity management system must be able to automatically recover from virtually any system failure once the system is brought back on line. This means that the system must store sufficient data to determine what its state was just prior to the system failure, and to re-initiate processing of all interrupted units of work, herein called steps, with as little backtracking as possible.
In most transaction processing systems, system recovery is implemented by restarting all interrupted transactions at those transactions' beginning. Log records are stored at the beginning and end of each such transaction, enabling a system failure recovery routine to determine which transactions have been completed and which were in mid-process when a system failure occurred. This solution is not suitable for activity management systems handling long running work flows, since that recovery method would mean the redoing of much valuable work.
An additional problem that distinguishes long running work flows and short lived transactions is the problem of keeping sufficient records concerning the status of each transaction. For short lived transactions, it is generally sufficient to generate and store log records (A) marking the beginning of each transaction and recording sufficient data to restart that transaction, (B) recording changes made to various data structures so that those changes can be reversed if necessary, and (C) marking the conclusion of the transaction once the results of the transaction have been permanently stored. For long running work flows, backing up the system to undo all the work performed by the work flow up to the point of a system failure will typically be much more involved and in some cases may be virtually impossible.
Another problem associated with long lived activities or work flows concerns the use of data interlock mechanisms. In order to prevent two different computations from accessing and making inconsistent changes to a record in a database or to any other specified object, most multitasking computer systems provide interlock mechanisms that allow one process to have exclusive use of a specified object until the transaction either completes or explicitly releases its lock on the object. In most cases, a process maintains a lock on each object used by the process until either the process is completed and its results are permanently stored, or the process aborts and any interim changes are reversed. The problem associated with long lived activities is that locking the objects used by each work unit for a long period of time could result in system deadlock, where many work units are unable to proceed because other work units or work flows have locks on objects needed by the blocked work units. Clearly, the extent of the deadlock problem is related to the average number of objects used by each work flow and the average amount of overlap between work flows as to the objects used by those work flows. Nevertheless, the time duration of long lived work flows greatly increases the chances that work flows competing for resources will be delayed for significant periods of time.
One additional problem associated with long lived work flows that is not a problem with short lived transactions concerns tracking those work flows. For short lived transactions, it is generally sufficient to know that each transaction is either in process, in process but blocked from proceeding because a required resource is not available, aborted, or completed. However, for long lived work flows it is important to monitor the status of each work flow at a much greater level of detail.
In summary, problems that distinguish long lived work flows from short lived transactions are recovering interrupted work flows, deadlocks caused by data interlocks, and the need to be able to track or monitor the status of work flows that are in process.