The present invention is concerned with the management of complex IT infrastructures (1) consisting of distributed heterogeneous resources (see FIG. 1). The management of such IT infrastructures is—in most cases—done by performing a number of system management tasks (tasks) in a certain sequence in order to reach a certain management goal. Such a sequence of system management tasks is called a system management flow (2) within this invention.
Each task in a system management flow fulfills a certain sub-goal within the overall flow and thus contributes to the overall goal of the complete system management flow. System management tasks (10-12) are provided by system management applications (e.g. Tivoli Provisioning Manager, Tivoli System Automation etc.) and can be leveraged to perform certain actions on the managed IT infrastructure (1). For example, tasks (10-12) provided by Tivoli Provisioning Manager can be used to provision new resources to an IT infrastructure (1).
In order to allow integration into a management flow, said tasks (10-12) provide standards-based web services interfaces (13-15) via which tasks get invoked (20) during the execution of a system management flow.
From an architectural perspective, system management flow (2) in FIG. 1 is primarily a logical flow description that arranges the single tasks according to their dependencies among each other. That is, a task N might depend on the result of a task N−1, and yet another task N+1 can only be executed if task N has finished.
In order to get executed, such a logical system management flow has to be converted (encoded) into a detailed flow definition that can be executed by a Workflow Engine (19). Typically, such a detailed flow definition contains the following items for each task (e.g. task 3) defined in the logical system management flow: invoke the task via its web services interface; wait for the response; analyze and process the response. A commonly used standard for said detailed flow definitions is the Business Process Execution Language (BPEL).
While the logical system management flow (2) is a mostly straight-forward definition of a certain sequence of tasks, the detailed flow definition (e.g. written in BPEL) can become very complex as soon as it comes to the handling of errors that can occur in single system management tasks. Errors that occur during runtime have to be resolved before the next system management task can be executed. The way in which errors are handled has to be explicitly defined within the detailed flow definition.
Moreover, system management tasks invoked by a system management flow often contain a number of internal sub-steps (e.g. task 2; 16-18). For complete error handling it is necessary to explicitly react to each potential error that can occur in sub-steps in the system management flow definition resulting in very complex constructs (4) for the invocation of one logical system management task (3). An error-aware definition for the invocation of a multi-step system management task would, for example, include the following item:                (5) invoke the system management task        (6) check the result of the task; in case of an error try to find out which sub-step failed        (7-9) depending on which sub-step failed performs a certain sequence of corrective actions and try to re-run the task in order to achieve the task's goal        
The error handling concept just explained has the goal to resolve errors in single tasks or even sub-steps of tasks in a system management flow in order to allow for a continuation of the overall flow. In other words, this concept allows for being able to process a system management flow from the beginning to the end. In some cases, however, it might not be possible to resolve an error in one task or sub-step of a task. With the above error handling scheme the overall flow could not continue in such a case, since errors must be resolved before proceeding to the next task. This approach often leaves the managed IT infrastructure in an inconsistent state.
Instead of getting stuck at one point within a flow, it is often desirable to either                (1) roll back all the work done so far in order to reach the consistent system state that existed before the flow, or to        (2) go on processing the system management flow in a forced manner in order to get as much of the remaining tasks done as possible.        
Option (1) gives the flow a kind of transactional semantics: “do all or nothing”. Option (2) allows for processing as much of the work as possible, leaving only a few open tasks that may have to be performed manually by an operator.
With current workflow techniques such as BPEL, it is possible to implement both of the mentioned options (1) or (2). BPEL allows for starting alternative flows whenever something goes wrong in the original workflow. However, it is required to explicitly model those alternative flows for doing compensation or forced processing. In particular, it might be necessary to provide definitions for compensation or forced flows (21) for each potential position in the original flow where an error can occur.