This application claims priority from German patent application number 19959181.4, filed Dec. 8, 1999, which is hereby incorporated herein by reference in its entirety.
The invention describes a system and method for automation of a computer network, in particular for automatic starting and stopping of resources in the computer network taking account of their start dependencies.
The interaction between various IT resources in connection with non-determinable requests and events forces the operators of computer centers to monitor, and where necessary correct, the behavior of the said systems. This activity is usually termed Operations Management. Automation as a subsidiary discipline concerns itself with the task of simplifying Operations Management.
In this, an attempt is made to analyze autonomously occurring events and to respond in accordance with an instruction. This instruction is customer-specific, since automation is not able to make the necessary decisions a priori.
A key aspect of automation is that the availability and delivery of IT resources depends on certain states of other IT resources. A specific computer center service (=an application) cannot be made available before all the necessary IT resources for it have been started. An application of this kind typically comprises several programs on different computers. These programs require data which are stored on various storage media. Very often it is also necessary to provide a comprehensive network topology which permits the individual components to communicate both internally and externally.
A further key aspect of automation relates to the fact that IT resources are required by several different applications simultaneously (shared resources). For example, a disk will typically store data for more than one application. Database software can be used for payroll accounting as well as for the product planning system (PPS), and networks are not limited to single applications.
The management of these structures with the goal of delivering and maintaining applications for a specific period of time (service window) is fundamentally supported by automation functions.
The fact that nowadays applications are distributed across several different systems and computers have been interconnected to form complex structures has made automation substantially more difficult.
In a large S/390 Sysplex network, for example, 5,000 programs can be launched in parallel. Application programs are distributed in several instances across computers in the Sysplex network. The workload is distributed across several computers, resulting in a higher throughput, and in the event of a total failure of a computer applications can continue running at reduced capacity on the remaining computers.
The dependencies between the individual IT resources are not dealt with adequately at present. For example, resources can only be started if other resources are already active, or should only be started when another resource has been stopped. The dependencies also apply to the stopping of IT resources.
Likewise, a resource could be started but could not fully perform its service without a specific resource being active/inactive.
The basic object of automation is to deliver or to terminate an application. When this object is fulfilled, automation attempts to maintain the attained state until a new goal is transmitted.
A change in goal may have many causes:
The operator wants to change.
A service window has expired or begun (driven by a calendar).
The system wants applications to be relocated or additional instances to be created as a result of capacity bottlenecks.
Such goals may be mutually contradictory and have differing levels of importance. At present automation functions barely respond to the demands of that rule. Each automation order is executed unconditionally, with no account taken of the preceding activities and without truly understanding why a resource is in its current state. As a result, it may occur that the operator stops an IT resource without knowing that the resource is still needed by another application.
Maintenance work is being carried out on a resource, for example. The allotted time for the work is not sufficient. In this case especially, the beginning of a service window must not result in the resource starting.
The automation described above is at present restricted to a single computer. Consequently, only dependencies between IT resources belonging to that computer can be defined. Typically they are programs which can be run on the computer on which the automation software is also active. Although today a large number of resources of a company are accessible over networks on different computers, a local automation software per se cannot centrally automate all remote IT resources because
the data exchange to monitor an automation process could overload the network (message traffic);
failure of the centralized automation software would result in the total failure of an entire computer network;
the number of resources needing to be automated simultaneously could result in bottlenecks in the automation software itself.
The limitation in force to date has meant that distributed applications could in no way be automated in their entirety. Rather, the automation software itself must attempt, as a client/server topology, to run processes directly on each computer locally and only escalate to a remote next highest instance when required.
This is illustrated by the following example (see FIG. 1), wherein a Web Server application requires the following three start-dependent programs:
1. Network (for example TCP/IP)
2. Database software (for example IBM DB2)
3. Web Browser software
For capacity reasons, each program is started on a different computer. However, the Web Browser cannot be started until TCP/IP and DB2 are both active. This behavior is dependent on the implementation of the Web Browser software. Ideally, this software would simply wait until its two partners are active. But this is not always the case: it can occur that a Web Server which has started too early simply terminates again.
Starting of this application is not fully automatable at present. The process must run as follows:
1. Start TCP/IP on system 1 and DB2 on system 3. This can be done by the automation functions of the respective systems.
2. Monitor both processes until both are active. Automatic launch of the Web Browser cannot be handled because of the limitation described here.
3. Start Web Browser.
In order to launch the overall application three orders are actually required to be entered by the operator. Experienced operators launch the Web Browser first. Assuming the automation is now capable of launching the start-dependent resources (TCP/IP and DB2) first, the overall application could be activated in this way. That works well until there are applications in which there are resources with no start relationships. Of course this component, too, can be entered somewhere in the start dependencies, but with the result that start processes are serialized.
Resources serve several applications (shared resources) and comprise several components (resource components) which can run on different computers. In the OS/390 Sysplex these components can run multiply on different computers (resource instances). This provides a better throughput, and a greater availability in the event of complete computer failures.
The automation can meet the preconditions before the desired resource is handled.
This solution does not support the following valid situations:
Resources in a start relationship do not necessarily have to be in a stop relationship in the reverse sequence. Typically, resources are more often in a start relationship than a stop relationship. With the present-day concept, when an application is terminated components are unnecessarily serialized, even though the computer capacities would permit more parallelism. The overall process takes longer.
Starting a resource should also trigger termination of another resource as an action, or conversely stopping a resource requires activation of a second resource. In computer systems in which two applications are to be only active exclusively, this concept can be utilized to automate the transition from one application to others (configuration switch).
Automation delivers value by taking over operations tasks and decisions to prevent the operator from causing damage as a result of incorrect decision-making. This can only be limited to deterministic events, of course. More complicated matters remain left to the discretion of a human operator. Thus two decision-making instances are produced which can easily come to different results. Present-day automation gives the operator every freedom, it is order-driven and attempts to implement the new order input more or less unconditionally. It is difficult for the operator to protect against incorrect decision-making.
When new orders are given no analysis is made of why a current state has been attained and whether the reason for the said state (goal) was not more important that the change now required.
If the operator is right, however, it is not possible automatically to restore the overwritten original state (with the associated reasoning). The operator himself must know what the old state was, and why it was set.
It is therefore the object of the present invention to deliver an improved system and method for automation of programs on distributed computers which avoids the aforementioned disadvantages.
This object is fulfilled by the characteristics in the independent claims. Further advantageous embodiments of the invention are recorded in the subclaims.
A major advantage of the present invention is based on the introduction of an Abstract Resource Model. This model results in the automation being divided into two: an automation execution component (Automation Agent or Resource Agent); and an automation decision-making component (Automation Manager or Resource Manager).
The Resource Agent controls how programs or resources are run within a specific environment. For this, the Resource Agent has at its disposal predefined routines to start, stop or monitor a program. The Resource Agent is preferentially installed on the computer on which the program or resource is installed.
The Resource Manager controls when a program or resource is run on an abstract decision-making level. The Resource Manager stores the dependencies of the programs or resources for starting or stopping. The programs are represented by a name. The automation decision-making component is non-system-dependent, and so can be installed on any computer in the automation system. An additional Resource Manager is preferentially installed on a second computer in case the first Resource Manager fails. The advantage of the Abstract Resource Model lies in the fact that the operator needs no specific knowledge as to which programs or program components belong to an overall application, or which programs or program components need to be started in which order.
Those programs or program components which necessitate reciprocal running can preferentially be assembled in a base group and are notified to the operator via the user interface under a base group name.
A further advantage of this invention lies in the fact that new programs only need to be incorporated as abstract resources in the Resource Manager. Only the specific run routines for the new program need to be implemented additionally in the Resource Agent.
A further advantage of the present invention is that the Resource Manager is based on a multi-relationship graph in which all the resources and their dependencies are mapped in a graph structure. Lastly, a further advantage of the present invention lies in the fact that a priority control is introduced for different requests.