Networks, the most representative example of which being the Internet, are widely used for providing various types of services. Among such services, there are services for on-line games and the delivery of video images in which service traffic is transmitted continuously for a relatively long time.
When a service of this type involves a failure occurring on a transmission path of the service traffic due to a malfunction in a switch or due to congestion in a router, etc., that failure greatly degrades the reception quality of the service traffic. When, for example, a ratio of packet losses increases, a longer time is required for transmitting service traffic. Thus, it is important to rapidly respond to failures occurring on paths (network).
In a conventional technology, the location at which a failure has occurred on a network has been determined on the basis of the personal experience or knowledge of the network manager. However, determination of the location of a failure by a network manager usually takes a long time, and accordingly a technology of automatically determining the location of a failure occurring on a network has been discussed.
FIG. 1 illustrates a method of estimating the location of a failure in a conventional network failure detection system. This system is for automatically detecting a failure occurring on a path of service traffic transmitted from an application server.
The network failure detection system illustrated in FIG. 1 includes a lot of nodes for executing a measuring agent as software, and a monitoring server 280. A node for executing a measuring agent is referred to as a “measuring node” hereinafter in order to distinguish it from a node that does not execute a measuring agent.
In FIG. 1, two measuring nodes denoted by 271 and 272 are illustrated. When a node referred to can be either of the nodes 271 and 272, such a node is referred to as a node 270. This is also applied to two application servers, which are denoted by 261 and 262, and also two nodes (denoted by 291 and 292) other than the measuring node 270. When, for example, an application server referred to can be either of the application servers 261 and 262, such an application server is denoted by 260.
The measuring agent has a function of measuring the transmission path and the reception quality of service traffic transmitted from the application server 260, and of transmitting to the monitoring server 280 a measurement result including the measured reception quality and transmission path. The monitoring server 280 analyzes the measurement result obtained from each of the measuring nodes 270 in order to determine whether or not a failure has occurred. When it is determined that a failure has occurred, the location of the failure is estimated. Thus, the estimation of the location of a failure is executed by a sequence containing an estimation by the measuring node 270 (ST1), a report of the measurement result transmitted to the monitoring server 280 (ST2), and an analysis by the monitoring server 280 (ST3).
An estimation of the location of a failure is performed as follows. Herein, it is assumed that service traffic transmitted from the application server 261 is measured by the measuring node 272, and service traffic transmitted from the application server 262 is measured by the measuring node 271. Specifically, it is assumed that service traffic transmitted from the application server 261 is transferred to the measuring node 272 through a path 1, which includes a link L4 between the application server 261 and the node 291, a link L2 between the nodes 291 and 292, and a link L5 between the node 292 and the measuring node 272, and also that service traffic transmitted from the application server 262 is transferred to the measuring node 271 through a path 2, which includes a link L3 between the application server 262 and the node 292, link L2, and link L1 between the node 291 and the measuring node 271.
A measurement result transmitted from the measuring node 272 to the monitoring server 280 includes a transmission path of service traffic in addition to the reception quality. On the table in FIG. 1, links included in the transmission path are expressed by “1”. The expression “deteriorated” means that the measured reception quality is low, i.e., that a failure has occurred in one of the links on the transmission path.
Reception quality is degraded by the congestion of service traffic. Using this relationship, the monitoring server 280 extracts a link that is used by all of the transmission paths that are “deteriorated”, and the extracted link is estimated to be the location of a failure. Thereby, in the example illustrated in FIG. 1, link L2 used by both of the paths 1 and 2 is estimated to be the location of a failure.
In the conventional network failure detection system as described above, the monitoring server 280 analyzes the measurement results transmitted from each measuring node 270, and estimates the location of a failure. Due to this configuration, the amount of information to be analyzed by the monitoring server 280 is often enormous, which is problematic. This problem becomes perceptible with the hardware resource requirement becoming higher and the time taken for determining the location of a failure becoming longer due to the greater load.
A method in which a plurality of monitoring servers are provided and information is distributed to them to be processed is proposed in order to eliminate the necessity of managing an enormous amount of information and to allow a rapid determination of the location of a failure. However, when the above distributed processing is employed, the monitoring servers analyze only part of all the information. Accordingly, an analysis covering the entire network is impossible, resulting in a lower accuracy of determining the location of a failure. Monitoring servers have to be provided in a number corresponding to the scale of the network. This means that a lot of monitoring servers have to be provided with a large scale network, making the cost of equipment enormous. Thus, distributed processing using a plurality of monitoring servers is not desirable in view of practical utility.
Reference documents include Japanese Laid-open Patent Publication No. 2006-238052 (Patent Document 1), N. G. Duffield, and et. al., “Simple Network Performance Tomography,” In Proc. of ACM SIGCOMM Internet Measurement Conference 2003. (non-Patent Document 1), Q. Lv, et. al., “Search and Replication in Unstructured Peer-to-Peer Networks,” Proc. of ICS'02, pp. 84-95, 2002. (non-Patent Document 2), E. Keong, et. al., “A Survey and Comparison of Peer-to-Peer Overlay Network Schemes,” Journal of IEEE Communication Surveys, Vol. 7, No. 2, pp. 72-93, 2005. (non-Patent Document 3), and I. Stoica, et. al., “Chord: A Scalable Peer-to-Peer Lookup Protocol for Internet Applications,” Journal of IEEE/ACM Transactions on Networking, Vol. 11, No. 1, 2003. (non-Patent Document 4).