In large scale information systems such as business information systems and IDC (Internet Data Center) systems, in accordance with an increase in importance of information and communication services such as web services and business services as social infrastructures, computer systems providing these services are required to keep operating steadily. Operations management of such computer systems is usually performed manually by administrators. As the systems have become large-scaled and complicated, a load of the administrator has increased tremendously, and a service suspension due to misjudgment or erroneous operation has become more possible to occur.
For this reason, there have been provided integrated fault cause extraction system which monitors and controls, in a unified manner, the operating states of hardware and software included in the above-mentioned systems. The integrated fault cause extraction system acquires information on the operating states of hardware and software in a plurality of computer systems, which are managed by the integrated systems, and outputs the information to a fault cause extraction apparatus connected to the integrated systems. Means to distinguish a fault of the managed system include the one with setting a threshold value for operating information in advance, and the one with evaluating a deviation of the operating information from its average value.
For example, in the fault cause extraction apparatus of the fault cause extraction system, threshold values are set for individual pieces of performance information and a fault is detected by finding the individual pieces of performance information exceeding the respective threshold values. The fault cause extraction apparatus sets a value indicating abnormality in advance as a threshold value, detects abnormality of individual elements and reports it to an administrator.
When detection of abnormality is reported, the administrator needs to identify a cause of the abnormality generation in order to settle it. A typical cause of the generation is, for example, CPU overload, insufficient memory capacity or network overload. However, in order to identify the cause of the generation, it is required to identify a computer which is likely to be related to the generation, and then to investigate its system logs and parameters. This operation requires each administrator to have a high degree of knowledge or know-how, and to spend much time and effort.
Due to this, the integrated fault cause extraction systems provide the administrator with the support for countermeasures, by performing correlation analysis on combinations of operating states and so on automatically, based on event data (state notification) acquired from a plurality of equipments, and estimating problems or causes from a wider standpoint and then notifying the administrators of them. In particular, for ensuring reliability in long-term continual operation of the services, it is required not only to take measures against the abnormalities which have already occurred but also to extract an element which is a possible cause of future abnormalities, even if the abnormalities have not occurred clearly at present, and then to take measures such as equipment reinforcement in a planned way.
Such fault cause extraction systems or the technologies related to correlation analysis and applicable in the systems have been described, for example, in each of the patent documents shown below.
Japanese Patent Application Laid-Open No. 2009-199533 discloses a technology which generates a correlation model by deriving a transform function with regard to time series of the values of two arbitrary performance information (performance values) in normal state, regarding one series as an input and the other as an output, compares the performance values according to the transform function of the correlation model with the performance information acquire at another time, and, detects a fault based on a degree of destruction of the correlation.
Japanese Patent Application Laid-Open No. 2009-199534 discloses a fault cause extraction apparatus which predicts a bottleneck that may occur in actual operation by utilizing a correlation model similar to that of Japanese Patent Application Laid-Open No. 2009-199533. Japanese Patent Application Laid-Open No. 2007-227481 discloses a technology which, in identification of production failures of semiconductor wafers, utilizes correlations derived from two-dimensional images, via resistances and so on, which are obtained by applying an electric current to the test patterns on a wafer. Japanese Patent Application Laid-Open No. H05-035769 discloses a correlation analysis apparatus which analyzes the presence or absence of “point of correlation abnormality” and, if the “point of correlation abnormality” exists, excludes the point from the analysis.
Furthermore, Japanese Patent Application Laid-Open No. H09-307550 discloses a network monitoring apparatus which, in the analysis of a network system, extracts a “representative alarm” from a lot of “alarms” that occurred, by focusing on regularity. Japanese Patent Application Laid-Open No. H10-257054 discloses a network management apparatus which, based on a correlation value between the fault events that occurred at a first and a second node groups, acquires their correspondence relation.