1. Field of the Invention
The present invention relates to executions of system management flows, and in particular to a method and system for automated handling of errors in execution of system management flows consisting of system management tasks.
2. Description of the Related Art
The present invention is concerned with the management of complex IT infrastructures (1) consisting of distributed heterogeneous resources (see FIG. 1). The management of such IT infrastructures is—in most cases—done by performing a number of system management tasks (tasks) in a certain sequence an order to reach a certain management goal. Such a sequence of system management tasks is called a system management flow (2) within this invention.
Each task in a system management flow fulfils a certain sub-goal within the overall flow and thus contributes to the overall goal of the complete system management flow. System management tasks (10-12) are provided by system management applications (e.g. Tivoli Provisioning Manager, Tivoli System Automation etc.) and can be leveraged to perform certain actions on the managed IT infrastructure (1). For example, tasks (10-12) provided by Tivoli Provisioning Manager can be used to provision new resources to an IT infrastructure (1).
In order to allow integration into a management flow, the tasks (10-12) provide standards-based Web services interfaces (13-15) via which tasks get invoked (20) during the execution of a system management flow.
From an architectural perspective, system management flow (2) in FIG. 1 is primarily a logical flow description that arranges the single tasks according to their dependencies among each other. That is, a task N might depend on the result of a task N−1, and yet another task N+1 can only be executed if task N has finished.
In order to get executed, such a logical system management flow has to be converted (encoded) into a detailed flow definition that can be executed by a workflow engine (19). Typically, such a detailed flow definition contains the following items for each task (e.g. task 3) defined in the logical system, management flow: invoke the task via its Web services interface; wait for the response; analyze and process the response. A commonly used standard for the detailed flow definitions is the Business Process Execution Language (BPEL).
While the logical system, management flow (2) is a mostly straight-forward definition of a certain sequence of tasks, the detailed flow definition (e.g. written in BPEL) can become very complex as soon as it comes to the handling of errors that can occur in single system management tasks. Errors that occur during runtime have to be resolved before the next system management task can be executed. The way in which errors are handled has to be explicitly defined within the detailed flow definition.
Moreover, system management tasks invoked by a system management flow often contain a number of internal sub-steps (e.g. task 2; 16-18). For complete error handling it is necessary to explicitly react to each potential error that can occur in sub-steps in the system management flow definition resulting in very complex constructs (4) for the invocation of one logical system management task (3). An error-aware definition for the invocation of a multi-step system management task would, for example, include the following items:                (5) invoke the system management task;        (6) check the result of the task; in ease of an error try to find out which sub-step failed;        (7) depending on which sub-step failed performs a certain, sequence of corrective actions and try to re-run the task in order to achieve the task's goal.        
As just explained, for performing error handling in system management flows it is necessary to explicitly include error handling instructions into flow definitions. That is, it is not sufficient to just define the logical sequence of system management task invocations, but instructions have to be included for handling each error that can potentially occur during the execution of tasks in the system management flow.
In addition to defining complex error-aware flow definitions (4) to correct errors in single tasks, there is a necessity to provide complete alternate flow definitions for non-recoverable errors.
Consequently, system management flow definitions can become very complex and the designer of the flow cannot just define the simple logical structure of the flow.
Detailed knowledge about the internal structure of invoked system management tasks and about possible corrective actions is necessary so define correct error handling instructions within system management flow definitions.
The reasons for the mentioned deficiencies are twofold. On the one hand, workflow engines executing detailed flow definitions are primarily just interpreting and executing flows defined in a flow definition language (e.g. BPEL) and do not include any automatic mechanisms for handling errors. Every step to be done has to be explicitly defined within the flow definitions. On the other hand, there is no sophisticated communication between the workflow engine and invoked tasks other than the invoke call and the response call returned by the invoked task. That is, there is a lack of communication (the lack of a certain protocol) concerning the handling of errors.