1. Field of the Invention
Example embodiments of the present invention generally relate to a method and apparatus for measuring voice quality in a VoIP network.
2. Description of Related Art
Distributed processing networks for live voice communications between network nodes use Voice over IP (VoIP) technology. In a VoIP system, after the speech is digitized, the digitized speech is divided into packets. Each packet includes a header and a data payload of one to several frames of encoded speech. Distributed processing networks for delivering the packets to desired endpoints are typically designed to provide a Best Effort (BE) single service model that does not discriminate in packet delivery between services and does not control service access or quality.
Quality of Service (QoS) architectures have been developed for BE environments to provide guaranteed transmission characteristics end-to-end, such as available bandwidth, maximum end-to-end delay, maximum end-to-end delay variation (jitter), and packet/cell loss levels to provide continuous data streams suitable for real-time phone calls and video conferencing. These QoS architectures include protocols such as the Resource ReSerVation Protocol (RSVP) and the Real-Time Transfer Protocol (RTP).
RSVP is a signaling protocol that guarantees receivers a requested end-to-end QoS. RSVP serves as an Internet signaling protocol through the transmission of QoS parameters. Under RSVP, an endpoint negotiates with the network to allocate or reserve protected resources for traffic that the endpoint will generate or receive. The two messages that perform the reservation request and installation are the Path and Resv messages. Robustness is achieved through maintaining a soft state network by transmitting periodic refresh messages to maintain a reservation and path state along the reservation path. If the intermediate nodes do not receive the refresh message, the reservation will time out and be deleted.
RTP is a voice bearer channel transfer protocol. RTP neither guarantees a QoS nor provides for resource reservations. RTP runs on the transport layer of the Open Systems Interconnection (OSI) model and defines a session by two components, namely its profile and payload format, where the payload is the data being transmitted. The payload format specifies the format of the data within the RTP packet such as encoding and compression schemes. RTP functions include loss detection for quality estimation and rate adaptation, sequencing of data, intra- and inter-media synchronization, session identification using a session id, source identification using a synchronization source id or SSRC, and basic membership information.
The Real-Time Control Protocol (RTCP), a companion protocol to RTP, is used by applications to monitor the delivery of RTP streams. Media packets are transmitted between endpoints during a session according to RTP while additional performance information governing the communication link (e.g., key statistics about the media packets being sent and received by each endpoint such as jitter, packet loss, round-trip time, etc.) are collected by the endpoints and transmitted to a session monitor according to RTCP. The network monitor can be, for example, VoIP Monitoring Manager™ or VMon™ by Avaya, Inc.
Under either the RSVP or RTP protocols, VoIP introduces a range of QoS problems which were not previously significant or, in some cases, even encountered in circuit-switched networks. Voice telephony depends upon reliable, low latency, real-time delivery of audio data. In VoIP, values for latency, packet loss, and jitter can increase substantially, particularly during periods of heavy network traffic. This can cause a user to experience a much poorer quality of communication (e.g., audio or video distortion, unacceptable levels of synchronization between audio and video streams, etc.) than would be experienced if the call were made by a traditional circuit-switched telephony network.
FIG. 1 illustrates building blocks of a conventional VoIP system 10. These building blocks are divided into three categories. The topmost block 15, known as the network layer, includes VoIP protocols and voice coder standards. These are standard components and are used by all VoIP manufacturers; thus components in the network layer cannot be a differentiator for voice quality in VoIP system. The network performance block 20 and audio processing block 25, however, are not standard in a typical VoIP system. The design and implementation of blocks 20 and 25 along with the platform could make a difference in the voice quality of a VoIP system.
FIG. 2 is a block diagram to illustrate the receive and transmit processing blocks of a conventional VoIP endpoint such as an IP-phone. Each block represents an algorithm or series of functions. The received path includes algorithms related to Jitter Buffer, packet loss concealment (PLC), comfort noise generation, automatic gain control (AGC) and audio equalization, each of which play a role in VoIP receive voice quality. The transmit path includes acoustic echo cancellation (AEC), dynamic compression, mic equalization/expansion, voice activity detection (VAD), silence suppression (SS), double talk detection and noise reduction algorithms. Each of these algorithms play a role in VoIP transmit voice quality.
Delivering and maintaining optimal voice quality is desired by all designers, network engineers, and manufacturers of components in VoIP systems. Hybrid systems, trans-coding and inherent delays in the VoIP system have introduced many challenges in attempting to maintain circuit-switch-type voice quality. For example, voice quality can become degraded due to a number of factors, including VoIP CODECS (inherently, there is at least some quality loss), distortion, network impairments such as packet loss and jitter, packet delay or latency, and background noise.
Distortion in either the analog or digital path can play a significant role in voice quality degradation. In the analog domain, distortion can introduce non-linearity in the echo path, which can cause clipping and poor echo performance. Distortion in the analog and digital paths could also affect speech coder performance.
Packet loss causes degradation in voice quality and typically occurs in bursts of 20-30% loss lasting 1-3 seconds. This may mean that the average packet loss rate for a call appears low, although the user reports call quality problems. Packet loss can occur for a variety of reasons including link failure, high levels of congestion that lead to buffer overflow in routers, Random Early Detection (RED), Ethernet problems, and the occasional misrouted packet. For example, packets can be lost when they encounter a queue in a router which is completely full, or when they are subject to policy-based discard, e.g., they are out-of-profile of their SLA.
Jitter refers to how variable latency (end-to-end delay variations) is in a network. Jitter greater than approximately 50 milliseconds can result in both increased latency and packet loss. Excessive jitter can result from congestion on LANs, access links, low bandwidth WAN links/trunks or the use of load sharing. For example, packets accumulate jitter when encountering varying router queue occupancies on their path through the network. As a result, these packets incur a different overall delay than that of their predecessors or successors. This skews the timing relationships between successive packets.
Jitter levels under 100 milliseconds may be acceptable if jitter buffer size in the endpoint is increased. For jitter levels exceeding 100 milliseconds, increasing the jitter buffer size to avoid packet discards introduce significant delay and cause conversational problems.
A jitter buffer in the endpoint temporarily stores arriving packets in order to minimize delay variations. If packets arrive too late then they are discarded. A jitter buffer may be mis-configured and/or could be either too large or too small. If a jitter buffer is too small, then an excessive number of packets may be discarded, which can lead to call quality degradation. If a jitter buffer is too large, then the additional delay can lead to conversational difficulty. A typical jitter buffer configuration is 30 milliseconds to 50 milliseconds in size. In the case of an adaptive jitter buffer, the maximum buffer size can be set to 100-200 milliseconds.
Latency, or packet delay, is a measure of the delay in a call. Both the round-trip delay between when information leaves point A and when a response is returned from point B, and the one-way delay between when something was spoken and when it was heard, is measured. The largest contributor to latency is typically caused by network transmission delay. High levels of delay (generally over 200 milliseconds round trip) can cause problems with conversational interaction. This may be due to the routing of the IP stream, mis-configuration of the jitter buffer (i.e., too large) at either end of the connection or high levels of jitter which are causing an adaptive jitter buffer to grow excessively large. For example, packets are delayed when they are processed (e.g., through encryption gateways), when they encounter non-empty queues in routing devices, and in play-out buffers (jitter removal). For delay exceeding 300 milliseconds, users may experience annoying talk-over effects.
The transmission of background noise is a critical parameter to control for naturalness of a conversation. Therefore, it is desirable to have a proper signal to noise ratio (SNR) for intelligible conversation and echo control.
Accordingly, one concern in a VoIP system is to provide a high voice quality, which necessitates an accurate measurement of voice quality. Moreover, a new challenge presenting designers and network engineers of VoIP systems is to determine an accurate way to measure voice quality during live calls. Conventional voice quality measuring techniques fall in four primary categories: subjective testing, P.862 (PESQ) testing, non-intrusive testing, and testing via the E-Model (ITU standard G.107).
Subjective testing is widely considered the most “authentic” method of measuring voice quality. However, subjective testing is a specialized and costly process. This approach is typically used by CODEC designers and equipment manufacturers to validate VoIP technology prior to deployment. A Mean Opinion Score (MOS) ranges from 1 for an unacceptable call to 5 for an excellent call. A typical range for VoIP would be from 3.5 to 4.2. A score of 5 is not obtainable as the VoIP codecs inherently introduce some amount of quality loss.
Perceptual Evaluation of Speech Quality (PESQ) is a mechanism for automated assessment of the speech quality enjoyed by the user of a telephony system. It is standardized as ITU-T recommendation P.862. P.862 (PESQ) is used to analyze the distortion that has occurred on test voice signals that have been transmitted through a VoIP network, and to produce an estimated MOS score. These algorithms are implemented in test equipment available from a number of companies. The advantage of the P.862 algorithm is that the algorithm measures the effects of much different impairment and their interactions. A disadvantage is that P.862 requires a call to be set up through the network for each test.
Non-intrusive or passive monitoring testing examines a stream of voice traffic and produces a transmission quality metric that can be used to estimate a MOS score. This has an advantage in that all calls in a network can be monitored without additional network overhead. However, a disadvantage of passive monitoring testing is that the effects of some impairments are not incorporated into the testing algorithm. Conventional non-intrusive or passive monitoring testing includes VQmon™, P.563 and E-model techniques.
VQmon™, a product of Telchemy Incorporated of Duluth, Ga., is a multi-platform, multi-vendor technology for measuring the quality of IP services and providing diagnostic data for problem resolution. VQmon/EP is designed for integration into endpoints such as IP Phones and IP Gateways. VQmon/SA is designed for use in stream analysis and is typically used as the core VoIP analysis software in analyzers, probes, routers, and service level agreement (SLA) monitors. VQmon provides passive monitoring through observation of the RTP stream and incorporates effects such as packet loss burstiness. This produces an R-factor, which is a metric that uses a formula to take into account both user perceptions and the cumulative effect of equipment impairments to arrive at a numeric expression of voice quality. The R-factor can be used to estimate a MOS score. VQmon can be embedded into VoIP Gateways and other end systems with virtually no impact on equipment cost or network traffic.
P.563 is a passive monitoring algorithm that analyses the voice stream in order to estimate call quality scores. As the P.563 algorithm is much more computationally complex (over 1000×) than VQmon, it is typically only used on a sampling basis. P.563 is one of a number of “analog” signal analysis tools used to measure signal distortion levels and identify problems affecting voice quality. P.563 produces inaccurate results for individual calls and is typically used for producing estimates of service quality when aggregated over many calls.
The E Model was originally developed within the European Telecommunications Standards Institute (ETSI) as a transmission planning tool, described in ETSI technical report ETR 250, and then standardized by the ITU as G. 107. The objective of the model was to determine a quality rating that incorporated the “mouth to ear” characteristics of a speech path. The range of the R factor is nominally from 0-100; however, values of below 50 are generally unacceptable, and typical telephone connections do not get above 94, for a typical range of 50-94. For wideband CODECs the R-factor may increase above 100, and is typically 110 for an unimpaired connection. The basic model is:R=Ro−Is−Id−Ie+A+W  Equation (1)
In Equation (1), Ro is a base factor determined from noise levels, loudness, etc. Is represents impairments occurring simultaneously with speech; Id represents impairments that are delayed with respect to speech, Ie represents an “equipment impairment factor”, variable. The variable A is an “advantage factor” and the variable W is a wideband correction factor.
The equipment impairment factor, Ie, is typically used to represent the effects of Voice over IP. For example, assuming default values for the other factors, the R-factor for an “ideal” G.729A connection with no packet loss, jitter or jitter buffer delay would be R=Ro−Ie=94−11=83. The advantage factor, A, is used to represent the convenience to the user of being able to make the phone call, i.e., a cell phone is convenient to use, therefore users are more forgiving on quality.
The conventional voice quality measuring techniques do not measure voice quality during a live call. The conventional voice quality measurements are performed prior to installation, made as an estimate of aggregated calls, or are calculated by monitoring a stream of voice traffic to determine an R-factor that can be used to estimate a MOS score. This is due in part to the fact that measuring voice quality during live calls is inherently complicated. A desired voice quality measurement algorithm has to account for the dynamic nature of the speech signal, the dynamic nature of the acoustic environment, and the dynamic nature of the VoIP channel.