In an information processing system, each constituent device creates an operation log to record its activities. Such logs are collected and analyzed to automatically detect failure in the system. Anomaly analysis is one of the methods used to discover a failure from collected logs. This method does not rely on manual definition of failure-indicating log conditions, but analyzes logs collected from a properly working system and recognizes normal characteristics of logs. The method detects a failure when a newly collected log exhibits an unusual tendency against the normal one. In one anomaly analysis, a correlation between different data items contained in logs is calculated in normal situations. If this normal correlation does not hold in a newly collected log, it is interpreted as a sign of failure.
For example, one proposed operation management device is designed to detect failure on the basis of a correlation model that represents correlations of performance values in normal conditions. More specifically, the proposed operation management device measures the values of several performance metrics (e.g., processor usage rate, memory consumption, and disk usage) in normal conditions and formulates therefrom a correlation model representing a correlation between two different performance metrics. The operation management device also keeps track of the latest values of those performance metrics in the light of the correlation that the correlation model indicates, thereby watching whether any performance values violate their normal correlation. The proposed operation management device locates the cause of a failure, based on which pair of performance metrics are violating its normal correlation.
In the case where two or more correlations are violated at the same time, the above operation management device counts the number of violated correlations that involves each performance metric. Then, based on the counted numbers, the operation management device identifies a performance metric that resides at the center of distribution of violated correlations and uses the identified central metric in the subsequent process of troubleshooting.
International Publication Pamphlet No. WO2012/086824
International Publication Pamphlet No. WO2013/111560
Failure at a single particular place may cause the resulting logs to exhibit a chain of abnormal variations in the values of data items. For example, a failure in a server may cause a change in the operating status of the server itself or in some information handling processes running on the server, and this change may further lead to a change in another server's operating status or information handling processes. It is therefore possible that violation is observed in multiple correlations at the same time. The conventional anomaly analysis methods are, however, unable to determine which data item is closest to the cause of failure because they are only capable of detecting multiple data items that are experiencing violation of correlations. In other words, the conventional anomaly analysis methods have a difficulty in properly locating the real cause of a failure, thus failing to reduce the load of troubleshooting.
The techniques disclosed in International Publication Pamphlet No. WO2013/111560 estimate the cause of a fault by determining a data item that is located at the central position of violated correlations. Violation of correlations may, however, occur in a chained fashion. This means that the data item at the central position is not always the closest to the cause of the failure.