In a technical field of an information system that is composed of a large number of servers and network devices that are installed in a data center or the like, importance of a service provided by the system, such as Web service, business service, as a social infrastructure increases. For this reason, it becomes indispensable that the each server for managing these services stably operates. An integrated management system which centrally monitors an operating status of a plurality of servers and detects the occurrence of a failure is known as a technology for managing such system.
For example, a system which obtains actually measured data online with respect to the operating status from the plurality of servers that are monitoring targets and detects abnormality when this actually measured data exceeds a threshold value is widely known as this integrated management system. However, in this system, when the abnormality is detected, it is necessary to narrow down the cause of the abnormality, for example, lack of memory capacity, load on a CPU, load on a network, or the like in order to restore the system.
Usually, in order to elucidate the cause of the abnormality, a system log and a parameter of a computer which seems to be relevant to the abnormality have to be checked. This check has to rely on a system engineer's experience and hunch. Therefore, it takes time and energy to elucidate the cause of the abnormality. For this reason, in a usual integrated management system, it is important to automatically perform analysis of a combination of the abnormal states or the like based on event data (state notification) collected from a plurality of devices, presume the big-picture problem point and the cause, notify an administrator of them, and support the administrator. Especially, in order to ensure the reliability of the service in a long term continuous operation, it is required to perform a planned enhancement of equipment by detecting not only the abnormality that has occurred but also performance degradation with which abnormality does not clearly appear or a sign of a failure predicted to occur in the future.
Here, the following technology related to such integrated management system is disclosed. A technology disclosed in Japanese Patent Application Laid-Open No. 2002-342107 reduces a service interruption time by limiting a restart range with respect to a process in which a software failure has occurred to a domain unit when it is identified that the detected system failure is the software failure.
A technology disclosed in Japanese Patent Application Laid-Open No. 2005-285040 collects continuous quantity information as an initial monitoring information from a plurality of network apparatuses, monitors statistical behavior of this continuous quantity information, collects a plurality of related monitoring information first when the behavior different from a usual one is detected, determines the each value, and thereby identifies the cause of a failure.
A technology disclosed in Japanese Patent Application Laid-Open No. 2006-244447 detects a failure tendency of various parameters in a data storage array and avoids the failure of the system. This technology controls an access to a memory array space composed of a plurality of data storage apparatuses and accumulates operation performance data from each data storage apparatus in a history log. This technology analyzes the operation performance data in order to detect the abnormal operation of the data storage apparatus and starts a correction process of the data storage apparatus in response to the analysis.
A technology disclosed in Japanese Patent Application Laid-Open No. 2008-9842 collects information about an operating state of a computer system, records correlation information showing a correlative relationship between the collected information, detects a failure which has occurred in a service carried out by a computer system from the correlation information and the collected information, and generates a process for recovering this failure. This technology determines an effect and an influence on the computer system by execution of this process by referring to the correlation information and decides at least one of whether or not to execute the process for which the effect and the influence are determined, an execution order thereof, and an execution time thereof.
A technology disclosed in Japanese Patent Application Laid-Open No. 2009-199533 obtains performance information for each of a plurality of kinds of performance items from a plurality of apparatuses to be managed and when the performance item or the apparatus to be managed is designated as an element, generates a correlation model for each combination of the elements based on a correlation function between a first performance series information showing a time series variation of performance information for a first element and a second performance series information showing the time series variation of performance information for a second element. This technology analyzes whether or not the performance information newly detected from the apparatus to be managed keeps the correlation model and if a result of the analysis is negative, it is determined that the element is abnormal.