The present invention relates to a technique for accurately detecting a system failure, and particularly relates to a technique for accurately detecting a failure in a system where a plurality of servers communicate with each other.
In recent years, a large scale website is provided not by a single server but by a system including a plurality of servers. This type of system is called a multi-tier server system, and includes a servlet server for performing control over an HTTP protocol, an application server for operating a called application, a database server for performing the transaction of a database, and the like. In order to detect a failure which occurs in this type of multi-tier server system, a server for monitoring provided separately from this server group, is conventionally used.
The server for monitoring regularly collects the status of a server in a system from each server. For example, the statuses of hardware such as a supply voltage, the temperature of a CPU and a CPU busy rate are collected. Thereafter, when the statuses are different from normal ones, it is judged that an anomaly is occurring in the system. However, a judgment as to an anomaly occurring in software may fail by using only this type of server for monitoring. For this reason, each server is made capable of detecting a software-based failure by measuring a time required for a transaction requested by the server to another server, and by judging whether or not the length of the required time is within a predetermined range.
Refer to the following Japanese Patent Application Laid-open Publication No. 2001-282759 and Japanese Patent Application Laid-open Publication No. 2003-196178 as referential techniques related to failure detection.
In the above-mentioned multi-tier server system, there is a case where a first server requests a second server for a transaction, and where the requested second server further requests a third server for the transaction. In this case, even if a transaction response returned to the first server is delayed, the first server cannot determine which one of the second and third servers has a failure. In such a case, if the first server determines that a failure occurs in the second server, and changes a transmission path for a transaction request and the like, the processing efficiency is likely to decrease unnecessarily.
Furthermore, when a program which is operated on a server is written in a Java language (a registered trademark), Java middleware may regularly perform garbage collection (GC). GC is processing for releasing a memory region which is reserved by a program but is no longer used, and is carried out independently from the operations of the program, and regularly, for example. In this case, although the processing in the server is temporarily delayed, it returns to the original state immediately after GC is completed. From the viewpoint of efficient use of a system, it is inconvenient to judge such a temporal state as a failure occurring in a server.