The invention relates generally to network communications. More specifically, the invention relates to a real-time troubleshooting framework for VPN backbones that troubleshoots network events exhibiting significant disruptions, and provides root cause analysis and mitigation suggestions.
Layer-3 Virtual Private Networks (VPNs) have had significant and growing commercial deployments. Effectively troubleshooting network events in VPN backbones for provider-based VPN networks is critical since these networks often carry traffic of important applications. Compared to traditional IPv4 (Internet Protocol version 4) networks, there is an even bigger scalability issue for managing a VPN backbone due to each VPN customer's ability to use the entire IPv4 address space.
A VPN is a communication network tunneled through another network and dedicated for a specific network. One common application is secure communication through the public Internet, but a VPN need not have explicit security features, such as authentication or content encryption. VPNs can be used to separate the traffic of different user communities over an underlying network with strong security features. A VPN may have best-effort performance, or may have a defined Service Level Agreement (SLA) between the VPN customer and the VPN service provider. Generally, a VPN has a topology more complex than Point-to-Point (P-to-P). The distinguishing characteristic of VPNs is not security or performance, but that they overlay other network(s) to provide a certain functionality that is meaningful to a user community.
A layer-3 VPN is a set of sites where communication takes place over a network infrastructure called a VPN backbone with restricted communication from external sites. The VPN backbone is typically shared by multiple VPNs which are referred to as VPN customers.
VPNs were previously provisioned using layer-2 technologies, such as Frame Relay (FR) and Asynchronous Transfer Mode (ATM). However, layer-2 VPN does not scale well because the number of required virtual circuits achieving optimal routing scales non-linearly as the network grows. Recently, layer-3 VPN has had significant and growing commercial deployments. Unlike layer-2 VPNs, layer-3 VPNs use Border Gateway Protocol (BGP) and a set of extensions known as BGP-VPN, to exchange the routes for VPN prefixes of a VPN customer among all the Provider Edge (PE) routers that are attached to the same VPN customer.
Similar to IP networks, VPNs are vulnerable to unexpected network events such as hardware failures, misconfigurations, and routing disruptions. Because VPNs usually carry mission critical applications, such as Voice over Internet Protocol (VoIP) and financial transactions that do not react well to network disruptions, it is highly desirable for network operators to react quickly to failures to ensure reliable services.
There are a number of challenges in achieving real-time troubleshooting of VPNs. First are the common problems associated with managing large IP networks. For example, the data volume is significant and consumes significant resources. Network measurement data can be imperfect and missing due to measurement errors such as noise and transmission errors such as data loss. Additionally, the troubleshooting tool needs to satisfy real-time constraints so that operators are able to react quickly to network events.
Second, compared to a traditional IP network, operators are facing an even bigger scalability issue with VPNs due the freedom to use the entire IPv4 address space by each individual VPN. A Route Distinguisher (RD) is assigned to each VPN. The tuple (RD, IP prefix) is used to uniquely identify a VPN prefix in a VPN backbone. As a result, the total number of routes observed in a VPN backbone is significantly larger than that observed in an IP backbone.
Third, unlike IP backbones where each edge router maintains routes for every prefix, each PE router in a VPN backbone only keeps routes for VPNs attached to the PE. Therefore, a troubleshooting tool has limited visibility unless a large number of PE routers are monitored further increasing the scalability challenge.
In view of the above challenges, what is desired is a system and method for a scalable and robust network troubleshooting framework for VPN backbones that addresses scalability in addition to common problems associated with managing large IP networks including dealing with imperfect data, handling significant data volume, and satisfying real-time constraints.