Modern large-scale routers forward packets in hardware and are capable of extremely high switching rates. They typically support OC-48c and OC-192c link speeds with the ability to bundle several physical ports together to form even higher capacity logical links. Higher speed interfaces such as OC-768c are already available and will likely become a reality in a production backbone sometime this year.
One thing that is still common to all these high capacity connections is the need for some amount of output buffering to accommodate traffic fluctuations and bursts. When QoS is enabled on an interface, in the context of this discussion, it means that there are a minimum of two queues available for this purpose FIG. 1. Typically, packets will be classified and placed into one of these queues based on the markings in the packet header (IP Precedence, DiffServ Code Point, MPLS EXP, etc.). When packets arrive in one of these queues at a faster rate than they can be serviced, they will accumulate. In its simplest form, this is the definition of congestion.
A QoS enabled interface, according to prior art, is shown in FIG. 1. The QoS enabled interface includes a minimum of two queues available for output buffering to accommodate traffic fluctuations and bursts to the network. The QoS enabled interface includes a plurality of queues 110, 120, 130 each connected to a network link 100 scheduler 180.
Packets that build up in a queue but are then quickly drained off experience a transient congestion condition that is referred to in this text as microcongestion. The time scale for this type of event may only be a few hundred microseconds. These short periods of congestion, at the higher link speeds listed above, will most likely go unnoticed by end users as well as the monitoring systems designed to watch the network. While this kind of transitory output buffering is considered normal behavior for a router, the ability to detect and quantify these events can help identify undesirable traffic trends before they reach levels where they are potentially problematic. Conversely, this type of measurement could also allow for more controlled use of links at higher utilizations.
Although traditional passive and active measurements provide a considerable amount of network performance data, there are still significant gaps that can be filled. More specifically, it would be highly advantageous to have data at a very early stage indicating when traffic patterns or levels are in fact, and not just statistically, exceeding the available capacity of a link. Although something similar can be done with conventional active measurements, where the network is sampled with ping-like probes, the problem is that these methods usually require that performance be degraded to an undesirable level before detection is possible. This is often well past the threshold where the end user can also detect the condition. On very-high-speed connections these types of noticeable congestion events can represent huge amounts of data and impact entire regions of a network.
Passive performance measurements, via Simple Network Management Protocol (SNMP) polling, are probably the most prevalent form of network monitoring in use today. However, there are numerous shortcomings with passive measurements that must also be dealt with. Relatively frequent measurements, at even 5 minute intervals, are still too sparse to show short duration bursts or sudden traffic shifts. As a result, and to compensate for this lack of granularity, links are often run at a much lower utilization level than they actually need to be.
Another limitation of passive measurements is that in networks with diverse hardware and transport protocols they do not always provide a homogeneous view of true network load levels. Many large-scale backbone networks still utilize early generation hardware even in their newer higher-end platforms. Depending on what feature sets may be enabled, these components are very often not capable of line-rate rate performance. For passive measurements to be accurate they need to account for these situations where link utilizations less than 100% represent the upper limits of the link (i.e., 85% may actually denote 100%). Complicating matters is the fact that as features are enabled or disabled these limits may change.
Given the limitations of traditional passive and active measurement techniques, the detection of microcongestion can significantly contribute to the overall task of performance management. The ability to monitor these types of events can act as an early warning system by detecting true layer-3 congestion at its earliest possible stage. Because it simply reflects the current state of the device's layer-3 queues, it is independent of any inherent hardware limitations or lower-level protocol overhead. It can also play a considerable role in the capacity management of links where it is always desirable to highly utilize what can be extremely expensive resources.
The primary benefit of this type of measurement is its ability to detect transient congestion on even very-high-speed network links. Because the time frames for these types of events can be minuscule at high speeds, this methodology provides a level of visibility that would otherwise be impossible outside of a lab environment. Also, since it does not rely on any specific features or capabilities of the networking equipment itself, it is completely vendor independent. The test bursts are processed through the device as any other traffic would be.
There therefore remains a need for a cost-effective technique to detect and quantify extremely small congestion events on even large-capacity network links.