1. Field of the Invention
The present invention relates to an operation management method, operation management server, and operation management program for managing the operation of target devices, as well as to a computer-readable storage medium storing that operation management program. More particularly, the present invention relates to an operation management method, operation management server, and operation management program which effectively work in troubleshooting a problem with target devices being managed, as well as to a computer-readable storage medium storing that operation management program.
2. Description of the Related Art
The prevalence of Internet access environments in recent years has led to the effort to enhance the reliability of systems. One method is to introduce functional redundancy to the system. With redundant design, a failure in some part of a system will not disrupt the entire operation. The system can still operate with other functions that are alive.
Generally speaking, a failure or other problem event occurred in a server is reported to some other device (e.g., operation management server) in the form of messages. In a redundant system, a problem in one function propagates to other related functions, thus causing more error messages to be transmitted. That is, when a server encounters a problem, that server is not necessarily the only server that produces an error message. Actually, other related servers would send error messages as well.
The presence of multiple senders of error messages makes it difficult to locate the real problem source. Conventionally this task is delegated to network-savvy engineers who can locate a problem based on his/her experiences. Some less-skilled engineers, however, consume a long time to restore the system. In the case of an enterprise network, a delay in its recovery would cause a significant effect on the business activities of that company. It has therefore been a demand for a network system that can recover from failure without depending on the skill of individual service engineers.
One proposed solution is to have a database that stores records of each network failure, together with a time-series of failure notification messages that are produced consequently. A failure can be located by examining messages actually sent from the network in comparison with the database records. The proposed device automatically finds the location of a failure and thus enables quick recovery of a network system. See, for example, Japanese Patent Application Publication No. 2001-257677 (FIG. 1).
The device disclosed in Japanese Patent Application Publication No. 2001-257677 (FIG. 1) is, however, directed to communication failure on a network. It discusses nothing about how to deal with problems that an application or other programs running on a server may encounter. The proposed device does not provide functions of investigating a problem from error messages produced from applications, middleware, or operating system (OS) on a server.
The existing techniques do not allow us to identify the real location of a server problem when it generates a plurality of messages. Suppose, for example, that an application on a server has stopped for some reason. In addition to the originating application itself, some other programs including middleware and OS modules may also issue error messages. Particularly in an environment where a plurality of servers operate cooperatively, some application on another server would produce an error message as well.
As seen from the above discussion, one problem on a multifunction computer system could affect various applications running on different servers, resulting in multiple error messages. While the original problem derives from a particular software program on a particular server, it is not easy to reach the real cause and location of an error by merely investigating the received messages individually.
Things are more complicated in multitask and/or multithread system environments. In those systems, a problem with memory management could lead to a performance degradation in an application, or could disrupt a middleware module used by an application, without an apparent reason, in spite of the fact that the application has nothing wrong with it. It is hard to find the cause of such a problem, since the real culprit of the problem resides not in the software program that is performing poorly, but in some other place.