The invention relates generally to monitoring an internetwork consisting of multiple networks joined by routers. More specifically, the invention relates to methods and systems that monitor route flapping.
Border Gateway Protocol (BGP) is a gateway protocol that can be used on the Internet to provide loop-free routing within a single Autonomous System (AS) (internal BGP (iBGP)) or between different ASs (external BGP (eBGP)). The Internet consists of independently administered networks connected by routers to form a single internetwork AS. ASs are smaller internetworks and contain routers that exchange routing information with each other using various Interior Gateway Protocols (IGPs) such as Routing Information Protocol (RIP) and Interior Gateway Routing Protocol (IGRP). These IGPs do not scale well enough to handle the exchange of routing information between border routers that join various ASs together.
To exchange routing information between the border routers, Exterior Gateway Protocols (EGPs) are used such as BGP. Like RIP and IGRP, BGP is based on distance-vector or path-vector routing algorithms which enables groups of routers to share their routing information in an efficient and scalable manner.
The routing information BGP exchanges between border routers is called Network Layer Reachability Information (NLRI), and specifies which other AS's data can be forwarded to from the local AS and the most efficient routes (best path) for doing this.
One problem condition that may occur with dynamic routers on large internetworks is route flapping. When a router is flapping, it broadcasts routing table updates that alternate between two different routes to a host. For example, the flapping router may indicate during the first broadcast that route A is the best route to a given host, indicate during the second broadcast that route B is the best route, indicate during the following broadcast that route A is best, and so on. Flapping routers generate unnecessary routing traffic over the network. This generally happens when a router is unnecessarily configured to load balance between paths with equal hop counts. Routes that flap frequently are usually not reliable to send traffic to. If routes flap frequently, the load on all Internet routes increase. To determine whether a router is flapping, a network fault management system is used to analyze the received alerts from all routers in the network to determine the problems.
Route flapping is joined with interface flapping, where an interface on a router has a hardware failure that will cause the router to announce a route alternately as “up” and “down.” Route flapping may also be caused by hardware errors, software errors, configuration errors, intermittent errors in communications links, unreliable connections, and others, within the network which cause certain reachability information to be repeatedly advertised and withdrawn.
In networks where a link-state routing protocol is run, route flapping will force frequent recalculation of the topology by all participating routers. In networks where a distance-vectoring routing protocol is run, route flapping can trigger routing updates with every state change. In both cases, route flapping prevents a network from converging.
A state of convergence is achieved once all routing protocol-specific information has been distributed to all routers participating in the routing protocol process. Any change in the network that affects routing tables will break the convergence temporarily until this change has been successfully communicated to all other routers. Convergence time is a measure of how fast a group of routers reach the state of convergence. It is one of the main design goals and an important performance indicator for routing protocols to implement a mechanism that allows all routers running this protocol to quickly and reliably converge.
Certain configuration and hardware conditions will prevent a network from ever converging. For instance, a flapping interface may cause conflicting information to propagate the network so that routers never agree on its current state. Under certain circumstances it might even be desired to withhold routing information from parts of the network, thereby enforcing an unconverged network.
Route flapping alarms used today may cause many network outages to go undetected. If a route flapping alarm threshold is set too low, false positives may result. If a route flapping alarm threshold is set too high, intermittent conditions requiring action may be missed or responsiveness may be reduced.
The logic used to determine a route flapping condition between a local and peer router is built upon one session. What is needed is a method and system that improves route flapping alarm logic by combining information across multiple links (multiple pairs of local and peer routers) to eliminate false positives.