The present invention relates generally to networking, and more particularly, to a real-time tool for detecting and diagnosing routing problems by passive and active measurements.
As the Internet starts to carry more and more mission critical services such as Voice-over-IP (VoIP), it is imperative that network performance be maintained, and that network operators have the requisite tools to enable the quick detection and correction of failures in the control plane. Studies have demonstrated that there are many things that can negatively impact Internet routing, including misconfiguration, physical equipment failures, and cyber attacks. From a network operator's perspective, early detection of network routing problems is crucial, to enable mitigation of the same either directly or by the appropriate entity. For example, today's land-line telephone customers are accustomed to a 99.999% reliability rate. This translates into less than 6 minutes of downtime per year, which is a number far greater than the current reliability of public Internet service. As greater numbers of customers seeking to lower their telephone costs transition to VoIP, they will be faced with the reality of service interruptions, and network providers will be pressured to improve their response time in order to remain competitive. Currently, network operators primarily rely on three sources of information to detect Internet routing problems. They monitor routing protocol announcements, perform some limited active probing (mainly within their own network), and investigate customer complaints. For a variety of reasons, however, none of these approaches is sufficient to provide similar reliability to current land-line services.
Using routing announcements, it is difficult to determine the existence, location, and severity of a network outage in a timely fashion, as such announcements occur after a network outage occurs. Furthermore, since routing announcements are aggregated, even after they are collected it is difficult to determine the existence and location of a network outage. See Feldman, A., Maennel, O., Mao, Z. M., Berger, A., and Maggs, B., “Locating Internet Routing Instabilities,” In Proceedings of ACM SIGCOMM (2004).
Active probing consumes network resources, so most network operators only perform a limited amount of active probing within their own network, and to a small number of important sites outside their network at a modest frequency. Active probing may be warranted in certain situations, such as, for example, to determine if customers reach an important Web site (e.g., Google). The costs associated with active probing can be justified in cases where a site is contacted by many customers. However, in the case of calls that are made between a pair of VoIP endpoints or with typical peer-to-peer (P2P) communications, the limited paths traversed over the Internet do not warrant the cost of frequent active probing for such applications. On the other hand, if active probing is not frequently performed, it is impossible to react quickly enough to improve network uptime.
Waiting to receive customer complains in order to detect network outages is the least preferred method from a network operator perspective. Not only does this approach hamper customer satisfaction, but the necessity for human intervention renders it is slow and expensive, and can make diagnosis difficult. Descriptions of network problems that are typically provided by customers are often incomplete and misleading. Moreover, in the case of VoIP services, a customer may not even be able to reach the network provider if the network is down.
The Transport Control Protocol (TCP) is used as a reliable transport protocol for many Internet applications. TCP recovers data from loss, duplication, or out of order delivery by assigning a sequence number to each byte transmitted and requiring an acknowledgment (ACK) from the target receiver. When using TCP, sequence numbers are employed by the receiver to correctly re-order segments and eliminate duplicates. TCP uses slow-start and congestion avoidance algorithms to control data transmission. When congestion occurs, TCP slows down the packet transmission rate, and then invokes the slow-start algorithm to initiate the recovery.
TCP detects packet loss in two ways: Retransmission Time Out (RTO) and duplicate acknowledgement ACK. If an ACK is not received within Retransmission Time Out (RTO), the TCP sender thinks the packet is lost and retransmits the data. Alternatively, upon receiving an out-of-order segment, the TCP receiver sends an immediate duplicate ACK. This informs the network that a segment was received out-of-order, and of the expected sequence number. In addition, the TCP receiver sends an immediate ACK when the incoming segment fills in all or part of a gap in the sequence. This generates more timely information for the sender recovery. The TCP sender uses a fast-retransmit algorithm to detect and repair packet loss based on incoming duplicate ACKs. After the arrival of three duplicate ACKs (four identical ACKs without the arrival of any other intervening packet), TCP performs a retransmission of what appears to be the missing segment, without waiting for the retransmission timer to expire.
In view of the foregoing, there exists a need for a methodology for diagnosing routing problems that utilizes both passive and active measurements, while limiting the amount of active probing to conserve network resources.