The demand for sophisticated tools for monitoring network utilization and performance has been growing rapidly as Internet Service Providers (ISPs) offer their customers more services that require quality of service (QoS) guarantees and as ISP networks become increasingly complex. Tools for monitoring link delays and faults in an IP network are critical for numerous important network management tasks, including providing QoS guarantees to end applications (e.g., voice over IP), traffic engineering, ensuring service level agreement (SLA) compliance, fault and congestion detection, performance debugging, network operations and dynamic replica selection on the Web. Consequently, a recent flurry of both research and industrial activity has been focused on developing novel tools and infrastructures for measuring network parameters.
Existing network monitoring tools can be divided into two categories. The first category contains node-oriented tools for collecting monitoring information from network devices (routers, switches and hosts) using Simple Network Management Protocol/Remote MONitoring (“SNMP/RMON”) probe messages (see, Stallings, “SNMP, SNMPv2, SNMPv3, and RMON 1 and 2,” Addison-Wesley Longman Inc., 1999, (Third Edition), or the Cisco NetFlow tool (see, “NetFlow Services and Applications,” Cisco Systems, 1999 ). These are useful for collecting statistical and billing information and for measuring the performance of individual network devices (e.g., link bandwidth usage). However, in addition to requiring monitoring agents to be installed at every device, these tools cannot monitor network parameters that involve several components, such as link or end-to-end path latency.
The second category contains path-oriented tools for connectivity and latency measurement, such as “ping,” “traceroute” (see, e.g., Richard, “TCP/IP illustrated,” Addison-Wesley Publishing Company, 1994) and “skitter” (see, e.g., Cooperative Association for Internet Data Analysis (CAIDA), http://www.caida.org/), and tools for bandwidth measurement, such as “pathchar,” (see, e.g., Jacobsen, “Pathchar—A Tool to Infer Characteristics of Internet Paths,” April 1997, ftp:/ftp.ee.lbl.gov/pathchar), “Cprobe,” (see, e.g., Carter, et al., “Server Selection Using Dynamic Path Characterization in Wide-Area Networks,” in Proceedings of IEEE INFOCOM '99, Kobe, Japan, April 1997) “Nettimer,” (see, e.g., Lai, et al., “Measuring Bandwidth,” in Proceedings of IEEE INFOCOM '99, New York City, N.Y., March 1999) and “pathrate” (see, e.g., Dovrolis, et al., “What Do Packet Dispersion Techniques Measure?,” in Proceedings of IEEE INFOCOM '2001, Alaska, April 2001). As an example, skitter sends a sequence of probe messages to a set of destinations and measures the latency of a link as the difference in the round-trip times of the two probe messages to the endpoints of the link. A benefit of path-oriented tools is that they do not require special monitoring agents to be run at each node. However, a node with such a path-oriented monitoring tool, termed a monitoring station, is able to measure latencies and monitor faults for only a limited set of links in the node's routing tree (e.g., shortest path tree). Thus, monitoring stations need to be deployed at a few strategic points in the ISP or enterprise IP network so as to maximize network coverage while minimizing hardware and software infrastructure costs, as well as maintenance costs for the stations.
The need for low-overhead network monitoring has prompted development of new monitoring platforms. The IDmaps project (see, Francis, et al., “An Architecture for a Global Internet Host Distance Estimation Service,” in Proceedings of IEEE INFOCOM '99, New York City, N.Y., March 1999, incorporated herein by reference in its entirety) produces “latency maps” of the Internet using special measurement servers, called “tracers,” that continually probe each other to determine their distance. These times are subsequently used to approximate the latency of arbitrary network paths. Different methods for distributing tracers in the Internet are described in Jamin, et al., “On the Placement of Internet Instrumentation,” in Proceedings of IEEE INFOCOM '2000, Tel Aviv, Israel, March 2000 (incorporated herein by reference in its entirety), one of which is to place them such that the distance of each network node to the closest tracer is minimized.
A drawback of the IDMaps approach is that latency measurements may not be sufficiently accurate. Due to the small number of paths actually monitored, it is possible for errors to be introduced when round-trip times between tracers are used to approximate arbitrary path latencies.
Recently, Breitbart, et al., “Efficiently Monitoring Bandwidth and Latency in IP Networks,” in Proceedings of the IEEE INFOCOM '2000, Tel-Aviv, Israel, March 2000 (incorporated herein by reference in its entirety), proposed a monitoring scheme where a single network operations center (NOC) performs all the required measurements. To monitor links not in its routing tree, the NOC uses the IP source routing option to explicitly route probe packets along the link. Unfortunately, due to security problems, many routers frequently disable the IP source routing option. Consequently, approaches that rely on explicitly routed probe messages for delay and fault monitoring are not feasible in many of today's ISP and enterprise environments.
In other recent work on monitoring, Shavitt, et al., “Computing the Unmeasured: An Algebraic Approach to Internet Mapping,” in Proceedings of IEEE INFOCOM 2001, Alaska, April 2001, incorporated herein by reference in its entirety, proposes to solve a linear system of equations to compute delays for smaller path segments from a given a set of end-to-end delay measurements for paths in the network. Similarly, Bu, et al., “Network Tomography on General Topologies,” in Proceedings of the ACM SIGMETRICS, June 2002 (incorporated herein by reference in its entirety) considers the problem of inferring link-level loss rates and delays from end-to-end multicast measurements for a given collection of trees. Finally, Dilman, et al., “Efficient Reactive Monitoring,” in Proceedings of the IEEE INFOCOM '2001, Alaska, April 2001 (incorporated herein by reference in its entirety) studies ways to minimize the monitoring communication overhead for detecting alarm conditions due to threshold violations.
Reddy, et al., “Fault Isolation in Multicast Trees,” in Proceedings of the ACM SIGCOMM, 2000 and Adler, et al., “Tree Layout for Internal Network Characterizations in Multicast Networks,” in Proceedings of NGC '01, London, UK, November 2001 (both incorporated herein by reference in its entirety), consider the problem of fault isolation in the context of large multicast distribution trees. The schemes in Reddy, et al., supra, achieve some efficiency by having each receiver monitor only a portion of the path (in the tree) between it and the source, but require receivers to have some monitoring capability (e.g., the ability to do multicast traceroute).
Adler, et al., supra, focuses on the problem of determining the minimum cost set of multicast trees that cover links of interest in a network. Unfortunately, Adler, et al., supra, does not consider network failures or issues such as minimizing the monitoring overhead due to probe messages. Also, Adler, et al., supra, covers only links and not the problem of selecting the minimum number of monitoring stations whose routing trees cover links of interest; routing trees usually are more constrained (e.g., shortest path trees) than multicast trees.
Most of the systems for monitoring IP networks described above suffer from three major drawbacks. First, the systems do not guarantee that all links of interest in the network are monitored, especially in the presence of network failures. Second, the systems have limited support for accurately pinpointing the location of a fault when a network link fails. Finally, the systems pay little or no attention to minimizing the overhead (due to additional probe messages) imposed by monitoring on the underlying production network. Accordingly, what is needed in the art is a system that fully and efficiently monitors link latencies and faults in an IP network using path-oriented tools.