1. Field
The present invention relates to computer networks. More particularly, the present invention relates to Internet and method for real-time visualizing of BGP (Border Gateway Protocol) analysis and trouble-shooting. In the context of this document the term “BGP” shall mean what is understood by a person skilled in the art of computer networks.
2. Description of Related Art
The Internet consists of many separate Administrative Systems (AS) that manage their networks independent of each other. In the context of this document the term “Administrative System” shall mean what is understood by a person skilled in the art. Internet functionality depends heavily on the operation of routers, which forward each packet towards its destination based on local information. Any performance degradation of routing protocol leads to a decrease in the overall performance of the Internet. Therefore, a primary concern for network operators, those who manage and maintain Internet domains, is the relatively smooth operation of the series of network domains that comprise the Internet.
The Border Gateway Protocol (BGP) (Y. Rekhter, T. Li, S. Hares, “A Border Gateway Protocol 4”, (BGP-40. Internet Draft draft-ietf-idr-bgp4-26.txt), Work In Progress, October 2004) is the routing protocol that ASes use to exchange information about how to reach destination address blocks. These destination address blocks are termed “prefixes” by person skilled in the art. Three aspects of BGP are important:
1) Path-vector protocol: Each BGP advertisement includes a list of ASes along the path, along with other attributes such as next-hop IP address. By representing the path at the AS level, BGP hides the details of the topology and routing inside each network.2) Incremental protocol: A router sends an advertisement of a new route for a prefix or a withdrawal when the route is no longer available. Every BGP update message is indicative of a routing change, such as an old route disappearing or a new route becoming available.3) Policy-oriented protocol: Routers can apply complex policies to influence the selection of the best route for each prefix and to decide whether to propagate this route to neighbors. Knowing why a routing change occurs requires understanding how policy affects the decisions. To select a single best route for each prefix, a router applies the decision process illustrated in FIG. 1 to compare the routes learned from BGP neighbors. In backbone networks, the selection of BGP routes depends on the interaction between three routing protocols: External BGP (eBGP}, Internal BGP (iBGP} and Interior Gateway Protocol (IGP). In the context of this document the terms “eBGP, iBGP, IGP” shall mean what is understood by a person skilled in the art. The border routers at the periphery of the network learn how to reach external destinations through eBGP sessions with routers in other ASes. A large network often has multiple eBGP sessions with another AS at different routers. This is a common requirement for two ASes to have a peering relationship. After applying local policies to get the eBGP-learned routes, a border router selects a single best route and uses iBGP to advertise the route to the rest of the AS. In the simplest case, each router has an iBGP session with every other router inside the AS (i.e. a fully-mesh topology). The routers inside the AS run IGP to learn how to reach each other. The two most common IGPs are OSPF and IS-IS, which compute shortest paths based on configurable link weights. The routers use the IGP path costs in the seventh step in FIG. 1 to select the closest egress point. The decision process in FIG. 1 is important to compare two routes based on their attributes.
Being one of the universally deployed major Internet protocols, the performance of BGP in terms of convergence, survivability and stability has huge impacts on the overall performance of the Internet. It is widely held by persons skilled in the art that high volumes of BGP routing updates, knows as BGP churns, coupled with slow convergence behavior of BGP can cause severe rippling instability across large portions of the Internet. For instance, BGP instability may lead to transient routing loops or loss of reachability for traffic destined for a certain network prefix. A link failure in a remote AS could trigger a shift in how traffic travels through a network, perhaps causing congestion on one or more links. High volumes of updates may overload router CPUs and cause router “melt-downs”, thereby leading to disruptions in traffic forwarding, etc. Ensuring good performance in an IP backbone network requires continuous monitoring to detect and diagnose problems, as well as quick responses from management systems and human operators to limit the effects on end users of Internet. In the context of this document, the term “operator” shall mean a human operator of a network management system in general unless specified otherwise.
Several approaches have been proposed on root-cause analysis of BGP routing changes: M. Caesar, L. Subramanian and R. Katz “Towards localizing root causes of BGP dynamics” Tech. Rep. CSD-03-1292, UC Berkeley, November 2003, D. Chang, R. Govindan and J. Heidemann “The temporal and topological characteristics of BGP path changes” Proceedings of IEEE ICNP, November 2003, A. Feldmann, O. Maennel, Z. Mao, A. Berger and B. Maggs, “Locating Internet routing instabilities”, Proceedings of ACM Sigcomm, August 2004, M. Lad, A. Nanavati, D. Massey and L. Zhang, “An algorithmic approach to identifying link failures”, Proceedings of Pacific Rim Dependable Computing, 2004, T. Wong, V. Jacobson and C. Alaettinoglu, “Making sense of BGP”, Nanog presentation, February 2004, K. Xu, J. Chandrashekar, Z. L. Zhang, “A First Step Toward Understanding Inter-Domain Routing Dynamics”, ACM Sigcomm 2005 Workshop on Mining Network Data, August 2005. These studies analyze streams of BGP update messages from several vantage points throughout the Internet, i.e., views from several routers in different ASes, with the goal of inferring the cause and location of routing changes. Although these approaches are interesting because they can answer important questions like “identification of ASes that are involved in the same problem and a very detailed analysis of the problem”, they lack considerations of what really matters to an operator, i.e. the view of their specific AS. On the other hands, in J. Wu, Z. M. Mao, J. Rexford, J. Wang, “Finding a Needle in a Haystack: Pinpointing Significant BGP Routing Changes in an IP Network”, NSDI 2005, the authors proposed a new approach for root-cause analysis that analyzes BGP routing changes seen inside a single AS in order to identify and quantify the effects of these changes on that specific network. Combining these two methodologies in a real system would be of tremendous help for any operator across the globe.
All methods belonging to the two families of approaches described above suffer two serious limitations described as follows. First, they produce “cold” textual reports that try to explain and justify events that happen behind the scenes. Current operators have to sit through these “large” reports to gain insight into their networks. This procedure is time-consuming and inefficient.
Second, most of the approaches offer only an off-line capability for data analysis. Although this is still of huge help to operators, detecting and reporting problems in real-time is mode productive due to the fact that operators can react earlier to a problem in progress avoiding to be exposed later on to major problems. The only work that claims to be able to process a large amount of BGP messages in real-time is presented in J. Wu, Z. M. Mao, J. Rexford, J. Wang, “Finding a Needle in a Haystack: Pinpointing Significant BGP Routing Changes in an IP Network”, NSDI 2005. We believe that some steps of their methodology, like grouping BGP events into BGP clusters, should be moved into a different module that has enough time to run more complex aggregation and analysis on the data to better qualify the anomaly and avoid introducing delay into the system.
Meanwhile, several visualization tools were proposed as an alternative to the approaches mentioned above. These visualization tools gained a lot of attention from the community due to their capability to give the user the possibility to develop an accurate vision of what is normal on their own network so that they can diagnose problems better. Although these tools represent a good start for detecting a problem, they lack the feature of classifying the anomalous events and pointing out directions or giving suggestions to the users of what their most likely causes are. In the context of this document, the term “user” means a user of a network management tool unless specified otherwise. In this definition a user may or may not be an operator which is a human operator of a network management system in general.
Several visualization tools have been proposed to monitor BGP dynamics in real time. In one existing system BGPlay, which maps BGP path attributes to an AS graph by L. Colitti, G. Di Battista, I. De Marinis, F. Mariani, M. Pizzonia, and M. Patrignani, Bgplay, when the user starts BGPlay, a query window appears, where the user enters the prefix and time interval. The BGPlay server then queries the database for all updates to the specified prefix during the specified time interval. An animation window displays routing activity of the specified prefix. The left part of the animation window shows a histogram which plots the number of events over time on a logarithmic scale. The main part of the window contains the AS graph. Each number represents an AS, and the originating AS is colored red. If the user clicks on an AS, its name and description is shown. Each line represents an AS path. Each path starts from the originating AS and stops at the AS of a RIS peer. The dashed lines represent paths that did not change during the query interval, while the solid lines represent paths that did change. Each solid path is drawn with a unique color for identification purpose.
Another system that also maps BGP attributes to an AS graph is LinkRank: M. Lad, D. Massey, and L. Zhang, “Linkrank: A graphical tool for capturing bgp routing dynamics”, Proceedings of the IEEE/IPIF Network Operations and Management Symposium NOMS, April 2004. In a LinkRank graph, the weight of an inter-AS link is determined by the number of prefixes having an AS path that includes that link. In a Rank-Change graph, the weight on each link is the difference between the LinkRank of that link over time. A negative weight indicates routes lost on a link, while positive weight indicates routes gained in that time period.
Another similar system is the TAMP graph (T. Wong, V. Jacobson, and C. Alaettinoglu, “Internet routing anomaly detection and visualization”, Proceedings of the 2005 International Conference on Dependable Systems and Networks DSN'05, pages 172-181, 2005) which shows how many prefixes are carried over an AS-AS link. At the bottom is an animation clock, displaying the time into the incident currently being shown. The plot to the right of the controls shows how the number of prefixes varied with time on whichever edge is selected in the TAMP graph. The edge colors indicate how the routes are changing: black means not changing; blue means the edge is losing prefixes; green means the edge is gaining prefixes; yellow means the prefix count is flapping too fast to animate; and an edge that has lost prefixes also has a gray shadow that indicates the largest number of prefixes it ever carried. The thickness of the non-gray part of an edge is proportional to the number of prefixes it is currently carrying. The Elisha system (S. T. Teoh, K.-L. Ma, S. F. Wu, and X. Zhao, “Case study: Interactive visualization for internet security”, Proceedings of the IEEE Visualization Conference 2002, pages 505-508, 2002) also contains network visualization of BGP updates. In this system, all the paths from the AS at the observation point to the originating AS of the IP prefix is plotted. This system also allows animation over time, so that at each frame, all the AS paths used in the time interval is displayed. The color represents the time (less recently or more recently) the path was used within the currently-displayed time window. These existing systems focus only on basic information, i.e. BGP updates, and do not give any deeper insight into the problem.
Therefore what is needed is a system and method that establish a real-time interaction between end-users and network traffic such that users can gain insight of both network dynamics and hidden traffic patterns.