PTL1 describes one example of an operation management system that uses time-series information about system performance to model a system and uses the generated model to detect a failure in the system.
The operation management system described in PTL1 determines a correlation function for each pair of a plurality of metrics on the basis of measured value of the plurality of metrics of a system to generate a correlation model of the system. The operation management system then uses the generated correlation model to detect destruction of a correlation (correlation destruction) and determines the cause of a failure on the basis of the correlation destruction. The technique to analyze the cause of a failure on the basis of correlation destruction in this way is called invariant relation analysis.
In the invariant relation analysis, correlation functions are calculated for all pairs of a plurality of metrics. The number of pairs for which correlation functions are calculated is proportional to the square of the number of metrics. Accordingly, if the scale (the number of metrics) of a system is large, the number of pairs for which correlation functions are calculated becomes huge, which makes it difficult to generate a correlation model in a predetermined period of time.
One way to perform calculation on a huge amount of data as described above is distributed processing. Among the known typical distributed processing techniques is, for example, Hadoop disclosed in NPL1. In Hadoop, an HDFS (Hadoop Distributed File System), which is a distributed file system, distributes data to be processed to a plurality of nodes. Then processing is executed by MapReduce on the plurality of nodes in parallel.
Note that a related technique is disclosed in PTL2 which is a method for determining nodes on which processing is executed in a distributed processing system such as Hadoop on the basis of delay in communication between nodes.