Large-scale Internet Protocol (IP) networks feature hundreds of routers and thousands of links. Monitoring of a network-wide state is usually achieved through the use of the Simple Network Management Protocol (SNMP). Routers collect measurements and report the measurements to a specific location at regular intervals. The regular intervals are typically set to five minutes. The selection of the time interval may be based on the size of the network, since the interval has to be large enough so that it allows the polling of the entire network during a single interval. Additionally, the interval should not be too small to avoid overloading the polled network elements.
The SNMP statistics that are usually collected over the five-minute intervals correspond to average activity of IP links and network elements for the duration of the interval. Within the five-minute interval, the system collects link utilization from every link inside a network. The utilization information is gathered and moved to a network management system in a central location.
Once a network operator decides on the time interval at which network elements are polled, the operator decides on the actual Management Information Base (MIB) metrics on which network elements will report. Usually these MIB metrics include link utilization, packet drops on a per link basis, the CPU utilization of the router itself, etc. In order to avoid router overload and limit the amount of SNMP traffic through the network, network operators usually select a small number of metrics to be polled by the Network Management Station and base their network provisioning decisions on those specific metrics.
Usually routers count how many bytes each link sends. The management station collects byte measurements and subtracts neighboring measurements to determine how many bytes were sent over the five-minute interval. This technique doesn't provide an adequate understanding of what happens during each time interval.
It is not uncommon for a network operator to collect link utilization measurements and infer delay performance based on the collected information. In fact each network provider usually aims for network-wide link utilizations that do not exceed the “acceptable utilization levels”. Those levels are specific to each network provider and are frequently set to around 50%. If each link in the network has a utilization level of less than 50%, it is capable of carrying the traffic of any equal-capacity link in its neighborhood in case of failure. Previous analytical work has shown that links having a utilization level of less than 50% introduce minimal queuing delays.
Network management systems typically use the above-identified measurements to determine if a link is overloaded. If the link is overloaded, it may be dropping customer packets or experiencing delays that are likely to violate service level agreements. However, conventional reports of traffic performance across periods of minutes can mask out performance degradation due to short-lived events, such as micro-congestion episodes, that manifest themselves at smaller time scales. Cases may exist in which utilization during the overall time period appears low according to accepted standards, yet packets may be dropped during smaller intervals. During the five-minute intervals, the system may at times be operating at full capacity. No technique is currently available for showing whether links experience small periods of high utilization that lead to increased delays and possibly packet drops within the five-minute intervals.
Accordingly, a system is needed for monitoring network performance at smaller durations within the currently measured intervals. The network management system may continue to gather data at the standard interval. Since the management system collects measurements from an entire network, the collection is asynchronous and collection from different links occurs at different times. If the network management system collects data more frequently, too much traffic will result. Yet, traffic counters collected every five minutes mask out micro-congestion episodes that occur at time scales of milliseconds. Accordingly, a new solution is needed for identifying the time scales in various networks over which micro-congestion episodes occur and for taking the measurements in accordance with the identified time scales.