Networks that rely on the Internet protocol (IP) networks are required to have a greater degree of reliability than ever before. IP backbones are being utilized more and more for applications that are more sensitive to packet drops than classical IP applications have been in the past. High quality voice and video applications require higher resiliency of networks in this context.
When a service outage occurs, measurements of the outage, its duration and cause(s) provide documentation of events for post-mortem analysis, base-lining of expected outages during convergence events, and predictive analysis for early recognition of future failures. Typical network diagnostic tools use probe packets to confirm network connectivity between endpoints. A failure of a probe packet to arrive at an endpoint signals a failure in a network path. Knowledge of which nodes in a network path received the probe packet may be used to estimate where in the network architecture the failure occurred.
While using probe packets in this way may provide useful network fault diagnostic information, current diagnostic tools generally provide information on a macro level about faults that require intervention (e.g., long duration faults).
In a typical fault identification system, network probe packets are sent over a network path according to a pre-programmed cycle. In this type of system, probe packets may be sent, for example, every minute, or every five minutes, to a specific targeted device. The process then moves to the next targeted device until all targeted devices have been sent probe packets. The cycle then starts again with the first targeted device. In this type of system, the resolution for outages is limited by whether the probe packet is actively sending/receiving traffic on the path affected during the instant that an outage happens. This basically means that an outage has to be longer than the cycle time of the probe packet cycle. Often, the fault identification system misses transient outages consisting of very small but steady loss or a burst of errors so brief that the aggregate impact to the thousands or millions of flows carried over large-bandwidth circuits is negligible. While such outages may not impact service in a noticeable way (i.e., create out-of-service conditions), they may affect the quality of video and audio streams and may trigger network responses (e.g., cause a TCP stream to slow down because the transient packet loss is misidentified by network management systems as congestion).
Systems for stress testing network configurations in lab environments typically utilize a testing device that operates as a probe packet source, sends probe packets to a network under test, and receives probe packets as a network endpoint from the network under test. The network configuration may be stressed, such as introducing a failure in a primary path, and the network response measured by the testing device. The closed loop allows the testing device to precisely establish a reference time for sent and received probe packets and to measure the time for the network under test to reroute traffic around the introduced error (sometimes referred to herein as the “convergence time”).
Observing the behavior of a network configuration in a lab environment provides useful data regarding selected protocols, selected devices and network topography. However, tools designed for use in laboratory environments are not effective for use in monitoring operational networks because such tools require strict control of network configurations, network traffic and probe timing that are not available in operational networks. Such tools are not useful for observing transient network phenomena.