1. Field of the Invention
The embodiments of the invention generally relate to computer systems, and, more particularly, to co-operative workflow environments running on those computer systems.
2. Description of the Related Art
Co-operative workflow environments consist of multiple workflow components deployed at different locations in a distributed infrastructure. Each workflow component works co-operatively with other components to complete an overall global task. The workflows communicate directly with each other (rather than through a centralized controller) in an asynchronous manner to transfer data and control when necessary. For example, cross organizational business processes can be implemented using co-operating workflows. Another example is decentralized orchestration of composite web services which is used for either performance benefits (in terms of throughput and response time) or for orchestrating composite web services in constrained data flow environments.
Fault handling (fault propagation and fault recovery) in general is essential in distributed systems in order to correct the effects of partial changes to the state of a system and restore the system to an error free state. In scenarios where there is support for fault recovery, fault propagation is necessary so that faults are propagated to the correct set of fault handling functions. In case of co-operative workflow the components typically execute independently. Thus, a fault occurring in one component will typically not be noticed by other components residing on the outgoing communication paths of that component or in paths parallel to that component. Furthermore, a client issuing a request is not notified about a fault occurring in a workflow component. This is hardly an issue in workflows with centralized control as the faults are generated locally on the centralized controller and the client can be notified easily and he can reissue the requests or take corrective steps. Hence, fault propagation (even in absence of fault recovery) is even more essential for co-operating workflows.
Two types of approaches are generally used for fault recovery in distributed systems—backward error recovery and forward error recovery. Forward error recovery (exception handling schemes) is based on the use of redundant data that repairs the system by analyzing the detected fault and putting the system into a correct state. In contrast, backward error recovery returns the system to a pervious (presumed to be) fault-free state without requiring detailed knowledge of the faults.
Various workflow systems (including systems that employ workflow partitioning) relied heavily on backward error recovery (although forward error recovery can also be used here) as most of the underlying resources were usually under the control of a single domain. These are specified using proprietary languages and usually do not handle nested scopes.
Other conventional solutions have focused on forward error recovery schemes enabling coordinated handling of concurrent exceptions using the concept of Coordinated Atomic Action (CA Action), which was later extended to the web services domain—Web Service Composition action or WSCA.
Fault recovery becomes a little more complex for cooperative workflows as the different workflow components may be distributed across different autonomous domains. Transactions (which fall under backward error recovery mechanisms), which have been successfully used in providing fault tolerance to distributed systems, are not suited in such cases because of various reasons. For example, management of transactions that are distributed over workflow components deployed on different domains typically requires cooperation among the transactional supports of individual domains. These transactional systems may not be compliant with each other and may not be willing to do so, given their intrinsic autonomy and the fact that they span different administrative domains. In addition, locking resources until the termination of the embedding transaction is in general not appropriate for cooperative workflows, still due to their autonomy, and also to the fact that they potentially have a large number of concurrent clients that will not stand extensive delays.
In co-operating workflows, propagation of faults (either programmed exceptions or exceptions arising due to failure of underlying resources) and recovery from those become complicated due to the following challenges:
First, there is no centralized global state as different workflow components execute on different nodes and communicate asynchronously with each other. In contrast, in workflows with centralized control, the entire state and all the faults remain localized to that central workflow component.
Second, correct placement of transaction scopes, fault handlers and compensation handlers in co-operating workflows is essential in order to maintain correct semantics of the application. Furthermore, the workflow components generally need to be modified with additional code to correctly forward and handle faults.
Third, different workflow components may execute at different times and have either overlapping or different lifecycles—which means that there is generally no single context available where all faults can be handled. Furthermore, workflow specification languages like BPEL4WS provides “scope” activities to define transaction scopes and associate fault handlers and compensation handlers with them to create a fault handling and recovery context. They also ensure fault recovery semantics in which compensation handlers are invoked in their reverse order of completion of their respective scopes. This complicates fault handling and fault recovery when the different workflow components run at different or overlapping times. A single transaction scope might span across various workflow components and the partitioned scopes (which reside in different workflow components) might have different lifetimes and as a result, their data is no longer available for compensating them. Thus, a mechanism is required to store the data of already completed “transaction scopes” so that it can be used for compensating them in case of a fault.
In addition, a fault occurring in one workflow component should not lead to a workflow component (that is expecting inputs from multiple workflow components) waiting indefinitely for an input from the erroneous component. This is an issue in co-operating workflows with distributed control flow as different workflow components can execute concurrently and forward their results to other workflow components for further processing. This will result in system resources being held up by the waiting workflow component and the performance of the system will go down over a period of time as the number of faults becomes significant.
In addition to augmenting the existing forward error recovery mechanisms, additional fault propagation schemes are needed for handling faults in cooperative workflows. Not much work has been done in this area. Therefore, there remains a need for a new technique capable of providing fault handling in a cooperative workflow environment.