The present invention pertains to systems and methods for monitoring and determining the quality of service (QoS) in a network. More particularly, the present invention provides QoS metrics including internal and external packet loss, the detection of stalled periods, and path delay estimates.
Most current network monitoring and analysis methods can be categorized into two groups depending upon where the monitoring is performed. The first category involves monitoring the performance of the IP network on a network level, where an Internet Protocol (IP) is defined to be the method or protocol by which data is sent from one computer to another on the Internet. Network level monitoring is performed by public and enterprise networks. The second category, which involves monitoring the subscriber access performance, is characterized by Service Level Agreement (SLA) monitoring.
Network level monitoring is usually done by the network operator and typically includes simple statistics, e.g., event counters on router interfaces for the amount of incoming and outgoing packets, bytes and number of lost packets. One of the most important aims of network level monitoring is to identify badly performing network elements and network congestion. On the other hand, SLA monitoring is usually performed by the subscriber to test whether the SLA is being kept by the network service provider. SLA monitoring typically involves information about the amount of traffic passing the access link, the Grade of Service (GoS) of the access link, and Quality of Service (QoS) of the access link, e.g., frame errors, bit error rate, downtime. The access link may be thought of as a selectable connection linking a subscriber from one word, picture, or information object to another.
A recent trend among IP service providers is to offer xe2x80x9cfiner grainedxe2x80x9d services to subscribers. For example, service providers offer finer grained services having different levels of TCP/IP service. The offered service can be loosely defined, as the case of Differentiated Services Networks (DSN), which provide a protocol for specifying and controlling network traffic by class so that certain types of traffic get precedence. The different levels are differentiated by a combination of access data rate (either guaranteed or average), guaranteed maximum or average packet delay (e.g., less than 100 ms), guaranteed maximum packet loss in the network (e.g., less than 1%). At present, only the so-called xe2x80x9cbest-effortxe2x80x9d service is generally offered, which guarantees none of the above. But if, for example, the provider wants to enable voice or video, (as in UMTS), then there will be a need for these xe2x80x9cbetter than best-effortxe2x80x9d services, otherwise the quality would be unacceptable.
As an alternative to DSN, the offered service may be very rigid, such as in networks offering voice over IP (VOIP) or other interactive real-time services in which data delays are not tolerable. Due to developments such as these, the monitoring of subscriber perceived QoS, or user satisfaction, is gaining increasing importance for IP service providers.
Conventional monitoring methods used by network providers are not able to monitor the satisfaction for individual subscribers because traditional methods perform tests on large traffic aggregates which do not allow to estimate QoS for individual applications, e.g., WWW, File Transfer Protocol (FTP), voice over IP, streaming video or audio applications. Hence, it is not possible to accurately estimate the packet delay, delay variation, and loss rate of individual IP telephony conversations based on router interface statistics. On the other hand, different applications require different levels and types of packet service quality. Therefore, it may not always be necessary to monitor an individual subscriber""s satisfaction for some applications.
In conventional circuit switched networks a simple network level measurement (e.g., average number of occupied circuits within a circuit group, or Call Blocking Probability) could be used very efficiently to calculate and engineer the GoS for the subscribers in a cost efficient way. In an IP network such analytic methods do not exist. Currently, Internet service providers (ISPs) generally apply a simple engineering rule-of-thumb based on one or more aggregate network level QoS measurements. For example, one rule-of-thumb could be: if the load or packet loss on a given link exceeds a certain level (e.g., 70%) in the busy hour, then the subscriber perceived QoS has probably degraded below the acceptable level, and so the link speed should be increased.
Such a rule-of-thumb approach can work well, and be economic, for large capacity links and in the case of best-effort services. In networks however, where economic considerations limit the possibility of overprovisioning (e.g., IP based mobile access networks), or if higher than best effort services are offered (e.g., voice over IP, DiffServ), it becomes desirable to have a better method for estimating the subscriber perceived QoS.
A number of conventional approaches have been used to obtain coarse estimates of user perceived QoS. Some examples of conventional approaches include NeTrueQOS, Concord, standards and drafts by the IP Performance Monitoring Working Group of the Internet Engineering Task Force (IPPM WG of the IETF), XIWT active network performance measurement architecture, and Ericsson Internet Network Monitor (INM).
A widely applied active method is based on active ping delay measurements. This is done by sending special Internet Control Message Protocol (ICMP) ECHO REQUEST (ping) IP packets to a host. When the host receives the packet, it answers the sender by a response packet within a very short time. By measuring the time it takes to receive the answer, the sending host can estimate the round-trip delay of the path between the two hosts. An advantage of ping is that the implementation of this method is not costly, since ping is available in all IP hosts and routers. Only the monitoring device has to be installed in accordance with the ping method. A related Ericsson product, INM, uses GPS synchronized clocks at network elements. A benefit of INM is that one-way delay can be measured.
Active methods tend to be disadvantageous in that they add significant extra load to the network. The main problem is that active delay measurements require considerable time and resources. In order to have a low variance test, an active delay measurement method would typically send hundreds of test packets. This drawback is exacerbated due to the fact that operators tend to be most interested in delays during the busy hours, when adding considerable extra load should be avoided. During low load periods, the extra loading is not as much of a concern. However, there is little interest in the delay during periods of low load.
Another type of convention approach involves active methods based on user emulation. Such methods uses active tests (e.g., test file downloads between two hosts, as a real user would do) and measures the throughput, loss and delay. This method is advantageous in that it is more efficient to approximate user satisfaction as the method emulates a user and the user""s applications. Thus, the QoS of different applications can be more accurately estimated. One example of an active method based on user emulation is Micromuse/Netcool, which can generate active tests for a number of important applications (e.g., Hypertext Transfer Protocol (HTTP), File Transfer Protocol (FTP), Lightweight Directory Access Protocol (LDAP), Remote Authentication Dial-In User Service (RADIUS), etc).
A disadvantage of active methods based on user emulation is that they require even more time compared to Ping. The continuous use of active user emulation would disadvantageously result in considerable additional load to the network. Moreover, the monitored services may not be the same as those service most frequently used by subscribers.
FIG. 1 depicts a conventional system of passive performance monitoring in which packets passing a probe are observed by the probe. The architecture for implementing a passive probe typically includes a passive network interface and a packet decoding process. For example, LIBCAP based tools (e.g., TCPDUMP) can be used to capture packets on the fly and decode protocol stacks on the fly. Then the conventional passive probe monitoring system produces several simple protocol-dependent statistics, e.g., protocol distributions. Examples of conventional passive probe approaches include CORAL, NIKSUN, LIBCAP, TCPDUMP, HP tools, network probes implementing IETF RMON 1-2, Sniffer, or RADCOM. Some of the conventional tools store the captured packets into a file, and perform more complex statistics off-line (e.g., RADCOM, CORAL, Sniffer).
A number of U.S. patents involve conventional passive probe methods. For example, U.S. Pat. No. 5,867,483 to Ennis, Jr., et al. describes a method for monitoring the access line throughput distributions over time, while displaying the data throughput levels (e.g. 10%, 20%) evolving in time. U.S. Pat. No. 4,775,973 to Tomberlin, et al. pertains to a method for gathering the amount of packets or bytes measured between end-hosts in a matrix format. Other conventional data analysis methods are presented in U.S. Pat. No. 5,251,152 to Notess, and in U.S. Pat. No. 5,446,874 to Waclawsky, et al. The common disadvantage of these methods is that they do not offer explicit information about the user perceived quality.
The general problem of conventional passive methods is that they can provide only very limited QoS statistics because of scalability limitations. More accurate user perceived QoS measures may be obtained by active methods. Another disadvantage of conventional passive monitoring tools is the requirement of placing network probes on every network element.
As a network wide monitoring system, conventional active monitoring methods would necessitate N*N tests periodically to gain end-to-end knowledge, where N is the number of network nodes (e.g., edge nodes) between which the end-to-end QoS measurement is made. Such an approach is not feasible for large networks having numerous routers and hosts. Due to this limitation, current active monitoring methods such as ping-based tools are generally only used for measurements between edge routers and a central host (monitoring host). This does not allow for precise end-to-end analysis from edge to edge.
The available real-time statistics from passive packet capturing probes tend to be fairly simple because, on large links, it is not possible to make statistics for each and every packet and user. For example, RADCOM can monitor very fast ATM links, but only on a per virtual channel (VC/VP) level.
Conventional systems are only able to perform more complex statistics off-line on previously captured and stored packet traces. For example, the xe2x80x9cNIKSUNxe2x80x9d tool can measure packet delay of a user chosen connection between two NIKSUN probes. This is done off-line, after correlating the packet capture logs of the distant probes. Furthermore, the NIKSUN method is seriously limited in the size of network that can be handled. (See WO 00/31963 published Jun. 2, 2000) Another method, xe2x80x9cPacketeer, xe2x80x9d is a packet shaper and analysis tool, all in one. As a packet shaper, it has attributes of being active as well as a passive analysis tool. The Packeteer tool classifies applications on the fly, and has a reserve service rate for mission critical flows. Passively collected statistics are available for these flows. However, this tool is only available for work only in enterprise networks, due to scalability limitations. Although both the NIKSUN and Packeteer tools offer flow related statistics, they do not offer user perceived and application dependent QoS measurements.
A disadvantage of current passive monitoring tools is the requirement for a network probe on every network element.
The present invention, which pertains to systems and methods for monitoring and determining the quality of service (QoS) in a network, overcomes the disadvantages of conventional systems, including, for example, the disadvantageous requirement for a network probe on every network element. The architecture of the present invention enables operation with as few as one or two devices at key points of the network. Later, if needed, further devices may be installed to refine or expand the system, in accordance with the present invention.
The present invention is advantageous in that it does not load the network since it involves a passive method. On the other hand, the present invention also advantageously delivers a similar quality and detail of statistics as could be achieved through use of an active method.
Instead of relying upon simple aggregate protocol statistics as per the conventional methods, the present invention performs sophisticated service dependent analyses to gain a reliable picture about the QoS perceived by subscribers. By xe2x80x9cservice dependent analysisxe2x80x9d it is meant that different applications delivering different services require specific measurements. For example, an FTP or WWW service is not sensitive to packet delays, but it is very sensitive to, for example, request-response times, aborted connections, stalled or congested periods, Domain Name look-up delays. One embodiment of a service dependent analysis in accordance with the present invention is the TEA analysis especially suited for FTP and WWW services. Another example of a service dependent analysis is RTP analysis. RTP is the protocol used for real-time conversations (e.g., voice). For traffic flows using RTP, it is important to know what is the delay, delay variance, and also if the packet loss is below the acceptable level.
The method of the present invention provides QoS metrics for TCP based applications, (e.g., packet losses, throughput efficiency). Analysis methods are presented to gain measures about true user perceived QoS. The measures also identify whether the problem originates in the inner or the outer network side. Instead of trying to capture each and every packet, a representative large subset of subscribers (e.g. 10,000 subscribers at a time) is monitored. In this way, the present invention is able to maintain scalability for very high speeds.
The present method can be efficiently used in networks consisting of hundreds, or more, of routers and large subscriber populations where placement of monitors in all routers is not economic. One example of such networks is mobile Internet services (e.g., GPRS, UMTS). The present invention can be best used when high aggregations of subscriber traffic are present and when monitoring the user perceived QoS is important for the network operator. Examples include IP access networks such as IP based radio access networks (e.g., GPRS, UMTS, BSS-IP). An advantage of the proposed method is that it scales well, and one device may be enough for implementation at start-up. As the network grows and more detailed information is needed, more devices can be installed.
In accordance with the exemplary embodiments of the present invention, subscribers who are currently using a particular service are sought out and focused upon in order to monitor the QoS of the service, instead of initiating conventional active measurements. Not all packets are monitored, since this would be impossible on large links. Rather, a representative subset of subscribers is chosen for monitoring. For these representative subscribers, sophisticated QoS analyses are done. In accordance with a further embodiment, the monitored subset gradually changes over time, so as to remain representative of the population of active subscribers which may change with time.
A passive monitoring architecture of the present invention enables the realtime analysis of large numbers of users in parallel, and in a scalable way. Because of the scalable architecture, it is possible to install monitors at relatively high aggregation points of the network. Thus, a large network of hundreds or more routers can be covered using a few devices or even one device [e.g., place it near the GGSN in the General Packet Radio Service (GPRS)].
Subscriber traffic is analyzed, taking into account the consideration that a subscriber may simultaneously use different applications in some instances, and therefore perceive different QoS for the different applications. Another factor which is considered is that applications running in parallel may disturb each other. The subscriber QoS is thus in connection with the QoS of the individual applications which may be active at the same time.
The present invention is capable of identifying, for example, whether a degradation of QoS is caused by the subscriber having too many Web pages open, or whether the problem exists in the network. This is done by monitoring the traffic not only of individual applications, but also by maintaining a subscriber traffic record containing statistics of the aggregate traffic of a subscriber.
In accordance with an exemplary embodiment of the present invention, a method is provided for end-to-end QoS metrics for TCP connections based on the observation of packet flows at a single monitoring point. These QoS metrics include, for example, packet loss internally and externally to the monitoring point, detection of stalled periods and estimation of path delay.
For streaming and real-time applications delay variation and packet loss are estimated for the paths between the monitoring point and the end-hosts. The result of the analyses may identify the source of a problem. In this way the present invention may answer the question of whether the source of a problem is inside the managed network or outside in another ISP""s area, for example.
A method is presented for monitoring the efficiency of meeting the subscriber SLA. The method, called Throughput Efficiency Analysis (TEA), can be used to detect SLA problems far from the actual access point for thousands of subscribers in parallel. Graphical methods to support the use of Throughput Efficiency Analysis are presented, which include, distribution of subscriber TEA, evaluation of internal/external network TEA.
Exemplary embodiments of the present invention are drawn to method of monitoring subscriber QoS in a network. In accordance with one exemplary embodiment, a monitor is installed in the network to be in communication with inbound and outbound traffic. The monitor may be, for example, a probe in the network, or, more specifically, a passive network interface. A representative subset of subscribers to be monitored is then selected, for example, by applying inbound and outbound traffic to a filtering function. Packet data received at the monitor is preprocessed to identify and store the accepted packet which is from the subscribers being monitored, that is, the subscribers which belong to the representative subset. Finally, a microflow record may be provided which includes statistics corresponding to the subscriber QoS of the network. The microflow record may include values for a subscriber IP address, a destination IP address, a subscriber port, and a destination port.
In accordance with one exemplary embodiment, the filtering function may be a mixing function in which a subscriber IP address is shifted to produce a shifted subscriber IP address. The shifted subscriber IP address is then compared with a value proportional to a tuning parameter.
In accordance with another exemplary aspect of the present invention, a subscriber traffic record of all applications running for a particular subscriber may be maintained. In this way, a source of QoS degradation for a particular subscriber based upon said subscriber traffic record may be determined.
ACK: acknowledgment packets
ATM: Asynchronous Transfer Mode
DNS: Domain Name Service
DSN: Differentiated Services Networks
FIN: A bit indicating the last packet in a successful TCP connection
FTP: File Transfer Protocol
GGSN: GPRS Gateway Support Node; A router node in a GPRS network
GPRS: General Packet Radio Service
GPS: Global Positioning System
GoS: Grade of Service
HTTP: Hypertext Transfer Protocol
ICMP: Internet Control Message Protocol
IETF: Internet Engineering Task Force
IPPM WG: IP Performance Monitoring Working Groupxe2x80x94An IETF working group developing standards for performance monitoring for the Internet.
INM: Internet Network Monitor
IP: Internet protocol
ISPs: Internet service providers
LAN: Local area network.
LDAP: Lightweight Directory Access Protocol
QoS: Quality of Service
RADIUS: Remote Authentication Dial-In User Service
RST: TCP Reset.
RTCP: Real Time Control Protocol
RTP: Real-time Transport Protocol
SLA: Service Level Agreement
TCP: Transmission Control Protocol
TCP/IP: Transmission Control Protocol/Internet Protocol
TEA: Throughput Efficiency Analysis
UDP: User Datagram Protocol
VC/VP: Virtual Channel/Virtual Path
VOIP: Voice Over IP
WWW: World Wide Web
XIWT: Cross Industry Working Team; One working group of XIWT addresses problems related to Internet performance analysis.