The present invention relates to an information processing system and in particular, to an information processing system including a plurality of computers using a network. More specifically, when an event such as failure has occurred in the information processing system executing a job, the operation management processing (such as re-execution of the job being executed) for the event is automatically executed by using a policy rule, thereby verifying whether the policy rule operates correctly.
There is known a method using a policy rule for automatization of operation management of the information processing system. For example, there is known a method for applying a policy rule to each job in a policy manger contained in a job manager managing job execution and automatically executing an operation management in case an event such as failure has occurred during operation of the information processing system (for example, see U.S. Pat. No. 6,504,621).
According to the U.S. Pat. No. 6,504,621, a job manager for managing a job is arranged in the information processing system and this job manager includes a policy manager. When a user of the information processing system puts a job into the job manager, the user specifies “an action to be performed when an event has occurred during execution of the job as “a policy rule”. Thus, the policy rule is applied to the policy manager. The event may be, for example, “abnormal job termination”, “abnormal stop of device executing the job”, and the like. Moreover, the action which can be specified may be for example, “re-execute the same job”, “notify the user”, and the like. When the policy manager detects an event of hardware failure of software failure in the information processing system, for example, when the policy manager detects an abnormal job termination, it references the policy rule and automatically performs the action described in the policy rule. Thus, when an event such as failure has occurred during job execution in the information processing system, an operation management work to cope with it is automatically executed.
On the other hand, there is also known a method associated with a test of an information processing system. For example, there is known a distributed application test/operation management system as follows. A quality measurement section for measuring the performance data in a component is embedded in the source code file groups of the distributed application, after which the source code file is introduced to a compiler. A server execution file required for starting/operating the server process is created and operated, and a quality data collection/analysis section collects performance data on the respective components from the quality measurement section. Moreover, normal operation data is collected from an application life cycle management section (for example, see JP-A-2002-082926).
The aforementioned background art has problems as follows.
Firstly, in the system disclosed in U.S. Pat. No. 6,504,621, if the content of the policy rule is incorrectly applied, execution of the processing described in the policy rule may generate a new problem. For example, when the policy rule “notify the user if a failure has occurred in the job” is applied, the user contact address may be incorrect. In this case, only after the failure has occurred in the information processing system being in process, it is found that the policy rule does not operate as is expected by the user. Moreover, when the failure has occurred, an appropriate operation management work cannot be performed. For the user, a greater loss is caused as compared to the case when no operation management work is automatically executed.
Secondly, the contents of policy rules applied to the information processing system may contradict to each other and when a particular event has occurred, another failure may occur. For example, there is a case that a policy rule “when a computer abnormally terminated, all the jobs being executed in the computer which has abnormally terminated are re-executed by an alternative computer” and a policy rule “when job X terminates abnormally, give up execution of job X and notify the user” are applied. In this case, if an event that the computer executing job X has abnormally terminated has occurred during execution of job X in the information processing system in operation, job X also terminates abnormally, and both of the policy rules are executed. As a result, in spite of the latter policy rule, the job X is re-executed by the former policy rule and there is a possibility that unintentional processing is performed such as data rewrite. Such a problem is easily caused when another policy rule is added to the information processing system in operation to which a policy rule has been already applied or when the policy rule applied is modified.
Thirdly, in general, instead of automatizing all the operation management works as policy rules, the information processing system is set in such a manner that in some cases a user (such as system administrator) of the information processing system manually executes the operation management work. When the information processing system is set in this way, the user should clearly grasp which events cause automatic execution of operation management work and which events require manual operation of the operation management work by the system administrator. There is a case, operation management work for an event is not automatized by a policy rule and the user is not prepared to manually execute the operation management work. In this case, the operation management work for the event may be delayed or may be incorrect and as a result, a great loss is caused for the user of the information processing system.
Fourthly, there is a limit on a test whether the policy rule operates as is expected by the user of the information processing system. For example, as is disclosed in JP-A-2002-082926, the test should be performed by using the information processing system itself which actually performs jobs. Unlike the performance measurement disclosed in JP-A-2002-082926, in general, when executing a test such as failure generation, it is necessary to stop the information processing system in operation. The system stop means temporary stop of the job being executed by using the information processing system and this is often not allowed.