Network systems consisting of a large number of components interconnected intricately have attracted attention in various fields. For instance, studies on descriptions of many-body systems of electrons or atoms are dominant in the field of solid-state physics. In the field of economics, relationships between nations or categories of industries have been discussed from long ago. Network systems onto which a large number of computers are connected also fall in this category.
A difficulty in computer network systems is that, because different observables for monitoring those systems exist at different layers of the OSI hierarchical model, one must determine which layer they should be focused on or how they should consider the relationships between the layers. Furthermore, there is the fundamental problem as to which observational data should be used for describing interactions between bodies in a many-body system since interactions between network nodes at each layer are not explicitly defined.
The more the importance of information technology in society increases, the more serious the impact of major faults of computer systems becomes. In recent distributed complex systems, it is desirable to provide some autonomic functions in the computer system itself to eliminate the need for a human administrator to constantly monitor the system.
The following documents are considered:                [Patent Document 1]        Published Unexamined Patent Application No.2003-60704        [Non-Patent Document 1]        H. Hajji, “Baselining Network Traffic and online Faults Detection”, IEEE International Conference on Communications 2003, Volume: 1, 2003, 301-308        
One method for providing a computer with such autonomy to detect anomalies in it may be measures to sense the whole system in a comprehensive manner to detect any sign of faults automatically. However, there is still the problem of how to describe the state of the whole computer system and detect faults in it, because of the complexity of observational data which is inherent in the multilayer structure of a computer network or the complexity of observational data in correlated systems.
For instance, products called Network Node Management Systems (hereinafter abbreviated to “NNMSs”) for computer system management are commercially available and widely used. However, these systems have poor automatic fault detection capabilities although they feature information gathering and visualizing capabilities. In fact, the NNMSs typically have SNMP (Simple Network Management Protocol) management capabilities.
However, because SNMP trap events occur too frequently in a default configuration and the individual trap events are not necessarily related to actual faults, some administrators keep the trap event transmission option turned off. Consequently, even if an NNMS is used, monitoring visualized observables constantly by a human administrator is practically the only solution to detect the sign of faults.
Considering Web-based systems for example, which are becoming increasingly important in business today, issues of system monitoring technology can be summarized as follows.
First, either observed values or random variables of observed values may be used for constantly monitoring systems. In Web-based systems in which typically observable quantities vary strongly over time, techniques in which observed values are directly used are difficult to apply because threshold values for detecting faults cannot easily be determined. Therefore, treating observed values as random variables may be a realistic method that allows anomaly detection in Web-based systems.
A second issue is whether behavior at layers below the TCP layer or at the application layer of the OSI hierarchical model should be monitored. For instance, in a large-scale three-tier Web-based system including an HTTP (Hyper Text Transfer Protocol) server, a web application server, and a database server, these servers cooperate with one another.
Operations in such a system are performed between servers through programs. For instance, a program of an HTTP server may call a program of a Web application server. Accordingly, an appropriate description of the interaction between servers at the application layer is essential as state descriptions. Therefore, the interaction between servers in a Web-based system would be able to be monitored by observing the application layer.
Third, observed values in multiple dimensions may be treated by considering correlations between observation points or each piece of data may be independently treated without considering correlations between observation points. If distributed processing is implemented by a three-tier Web-based system as described above, monitoring information about a single server independently is not effective. Another method for monitoring information about servers independently is the approach using the SNMP described above.
Based on these issues, some of prior-art technologies relating to system monitoring technology will be described below.
In an article entitled “Baselining Network Traffic and Online Faults Detection” (H. Hajji, IEEE International Conference on Communications 2003, Volume 1, 2003, 301-308) describes an anomaly detection technique in which information gathered by kernel monitoring of an OS on a server is modeled by using a mixed normal distribution to detect anomalies. The technique uses only low-layer observed quantities such as arpStatsPkts specified in the MIB (Management Information Base) of an SNMP agent. The article reports that change points only in individual quantities such as the number of ARP packets can be automatically detected. Therefore the article does not disclose a technique for detecting faults, including those at the application layer, in systems such as Web-based systems, which is the second issue.
Published Unexamined Patent Application No.2003-60704 discloses an approach to monitoring a system by predicting a threshold value for determining that there is an anomaly in a computer network system and updating the threshold value dynamically. However, this approach falls far short of practical applicability to fault detection in real computer systems such as Web-based systems because the observation unit time span for predicting the threshold value is several hours or days.
Another problem is that no adequate answer has been provided as to what should be observed as metrics for monitoring systems in order to implement these system-monitoring techniques in actual systems. As described above, appropriate descriptions of interactions between servers at the application layer are essential in fault detection in Web-based systems.
Japanese Patent Application No. 2003-432337 proposes a metric to be monitored. It describes techniques for calculating the dependency between services from the number of packets transmitted between servers for calculating the dependency between applications at runtime in a server system. Using these techniques, the dependencies between applications (for example applications running on servers such as an HTTP server and a database server) can be obtained as a weighted directed graph. This means that a matrix representing the dependencies between systems can be obtained.
There is no known effective technique in which a matrix representing the dependencies between systems is generated at predetermined time intervals and anomalies are detected from changes in the system over time. The present invention focuses attention on information about the dependencies between nodes in a system and provides a method for abstracting unnecessary degrees of freedom and automatically extracting those nodes with high “activity” that frequently interact with each other. Another object of the present invention is to build an automatic fault detection system using the approach.