Currently used information processing systems include a variety of electronic devices, such as server apparatuses, storage, communication apparatuses, etc. Such an information processing system may experience faults, such as Hard Disk Drives (HDD) failures, communication interface failures, etc. Therefore, a monitoring apparatus may be provided so as to monitor the operating state of the information processing system by collecting various types of messages from electronic devices. For example, when the monitoring apparatus detects a fault from the collected messages, the monitoring apparatus may lead an administrator to replace a server apparatus in use or change a communication route.
Some of such monitoring apparatuses may detect a fault symptom on the basis of collected messages before a fault occurs. For example, when detecting an increase in the number of failures in writing to HDD or a rapid increase in communication delay, a monitoring apparatus may inform an administrator of this matter as a fault symptom. This allows the administrator to take some countermeasures, such as replacement of a server apparatus, changing of a communication path, etc., before a fault actually occurs, which results in reducing the halt time of information processing and minimizing the impact of the fault.
As one example, there is provided a facility state monitoring method for detecting a fault symptom on the basis of data collected from facility, such as a plant or the like. This facility state monitoring method includes a learning phase for creating a normal model that represents the normal state of the facility, and an evaluation phase for detecting a fault symptom on the basis of the normal model and data collected from the facility. In the learning phase, a feature vector is created as the normal model from data obtained during a normal time. In the evaluation phase, a feature vector is created from currently collected data, and is then compared with the normal model. In the case where an “anomaly measurement” according to the distance of the feature vector is greater than or equal to a threshold, it is determined that there is a fault symptom in the facility.
Please see, for example, Japanese Laid-open Patent Publication No. 2011-70635.
One of methods considered for detecting fault symptoms is to learn message patterns that appeared when faults occurred in the past, and to determine that there is a fault symptom if any of the learned message patterns appears in a set of collected messages. Each message pattern to be learned is a combination of message types that appear with high probability within a predetermined time period before the occurrence of a fault. This detection method, however, has a following problem.
Messages collected from a monitored information processing system may include messages that have a low relevance to faults and that are successively generated as noise. For example, messages including mild warning information, which may be ignored by an administrator, may periodically be generated, like messages that are generated because a function of monitoring an unused communication interface is active. The types of messages that are collected as noise may change according to changes in the operating state of an information processing system, such as changes in the configuration of the information processing system, changes in business processes using the information processing system, etc. For example, noise is reduced by deactivating the function of monitoring an unused communication interface.
If a lot of noise is included in collected messages, the noise may be mixed in the result of learning message patterns indicating fault symptoms. If the types of messages that are successively generated as noise change from those that appeared at the time of learning, the message patterns obtained as the results of learning may not appear in collected messages. This causes a problem of failing to detect fault symptoms on the basis of the existing learning results. To deal with this, there is an idea of deleting the existing learning results and newly learning message patterns. However, re-learning message patterns each time the operating state of the information processing system is changed causes problems of increasing the load due to the re-learning and degrading the accuracy of fault symptom detection.