The present invention relates to a system management method and apparatus for a distributed computing system and more particularly to a system management technique for a distributed computing system that enables levels of importance of each business operation on a computing system (simply referred to as a job) to be utilized in the system management.
As the Internet use has spread rapidly and the computer performances have shown a significant progress in recent years, computers and their peripheral devices have come into widespread use in corporations and a growing number of business operations are being transacted on a computing system. Under these circumstances, failures or troubles that occur on the computing system have significant effects on their business in every corporation and it is now a great concern for each corporation to operate and manage efficiently and securely their distributed computing system (hereinafter referred to simply as a distributed system) distributed over an entire organization of the corporation.
A distributed system such as described above has generally been managed by using an integrated systems management (simply referred to as a system management) product. The conventional technique involves installing monitoring software called agent to keep track of objects being monitored, such as business servers, on the distributed system and displaying information on occurrence of failures and abnormal conditions (simply referred to as events) on an event console in a center for supervision.
There is known a conventional technique to determine the level of importance of a job as seen from a user of the distributed system, such as one disclosed in JP-A-10-83382. This conventional technique is designed to predict a future trend of constitutional elements of a job from the standpoint of system maintenance so that necessary steps can be taken before a failure occurs, such as adding memory and disk. This technique, however, does not consider how to deal with the current existing failures in the distributed system.
Another prior art is disclosed in JP-A-10-63539 for instance. This technique attempts to reduce the time it takes to deal with a large number of events occurring every minute by automatically classifying or ranking them according to their importance and content. The level of importance or urgency considered in this prior art, however, represents a severity of a trouble with system resources but does not take into account which job will be affected by that trouble, the significance of the affected job and the effect the halted job will have on the business of a corporation.