The present invention generally relates to voice communication over packet networks, and more specifically relates to a method and apparatus for improving voice quality in voice-over-packet networks.
A typical architecture of a voice-over packet system (focusing only on the voice communication part) is illustrated in FIG. 1. The voice encoders/decoders 10 and 12 shown in FIG. 1 are the most commonly used as per present ITU/T recommendations. However, such details may change over time, and are given in FIG. 1 for illustration purposes only. Many sources are readily available which provide a detailed description of the various components of a voice-over packet system.
Due to the inherent nature of packet-based data communication networks, although the voice-over-packet communication device sends packets to the other end at equal time intervals, when the packets are received from the network, they do not arrive at equal time intervals. Several phenomena cause the packet on the receive side to not to arrive in a regular time. The network behavior can change the time interval between two packets that arrive at the receive side. The difference in time of arrival of packets due to the network instantaneous load and behavior is called “jitter.” Sometimes, depending on the network protocol used, and the network conditions, the packets may even arrive in a sequence that is different from the sequence in which they were sent.
Another phenomenon that effects packet arrival is the clock differences between the transmit side clock and the receive side clock. This difference may result in too many packets or too few packets received by the receive side. Thus a clock recovery mechanism is needed to resynchronize the transmit side with the receive side. This patent offers an improvement in solving the jitter problem and the clock recovery problem.
Sophisticated Voice Over Network system use Voice activation detection to detect when no voice information is sent out, and cartel sending packets if only the background noise exists. However, most system alert the receive side that a period of no packets comes by sending a special information packet, SID (Silence Indicator) that convey the transmit background noise characteristics to the receive side.
Jitter Problem: As shown in FIG. 1, a voice-over packet system typically includes a network jitter compensator or jitter buffer 14. The network jitter compensator 14 temporarily holds the packets received from the network, and, if necessary, makes sure that they are in sequence. A typical architecture of a jitter compensator is shown in FIG. 2.
Although the long-term average packet arrival rate from a network is generally constant, over short periods of time, packets typically arrive from the network at random intervals. These packets are placed in a play out queue 16 as shown in FIG. 2, and are scheduled to be decoded and played out after a pre-determined amount of delay. For example, the packets may be scheduled for a play-out after a delay of exactly M packet periods. Once the play-out begins, every in-sequence packet is played out consecutively after the current packet. If this “nominal delay” parameter is set to a very short interval, it is possible that a packet will arrive very late from the network, and the voice decoder run out of packets. In that case, even if the packet arrives later, it is still effectively lost, because the window of opportunity for play-out has been lost. Such a situation may cause annoying distortions and degradation of the voice quality. However, if the play-out queue introduces too much delay, that would introduce a corresponding delay in the voice-Playout that may be perceptible and annoying to the human audience of the voice conversation.
Since network conditions may change over time, the typical objective of a jitter buffer algorithm is to constantly monitor the network conditions, and to adjust the “nominal-delay” to a minimum possible value, while ensuring that the packet loss due to network jitter is kept to a minimum. The algorithm that monitors the network characteristics, adjusts the nominal delay from time to time.
When the nominal delay is adjusted, the nominal delay may either be increased or reduced from its previous value. If the nominal delay is to be increased from its previous value, a “lost” packet is introduced in the packet stream that is received by the voice decoder (see FIG. 1). As a result, the voice decoder either plays silence for that short period, or attempts to hide the effect of packet loss by artificially generating some voice samples. If the nominal delay is to be reduced, typically more than one packet in the play-out queue is played out at the same time, and one of these packets is discarded, thus causing a discontinuity in the voice play-out.
Therefore, every time the “nominal-delay” of the jitter-buffer is modified, a discontinuity is introduced into the voice waveform, thereby degrading the voice quality. Another disadvantage of prior art implementations is that, since adaptation to “nominal delay” causes a degradation in the voice quality, the adaptation algorithms that are presently being used are very conservative. These algorithms tend to assign nominal-delays that are longer, and change their values less frequently, thereby increasing the overall system voice delay.
Clock recovery Problem: FIG. 3 illustrates a typical voice over network model, wherein a telephone 20 in system A establishes a link with a telephone 22 in system B, via a network. The standard sampling rate of telephony systems is 8000 samples per second. Each end of the link samples analog signal to digital and converts digital signals to analog 8000 times per second measured by its local crystal. However, these crystals might vary and the clock (that is derived from the crystal) in system A is different from the clock of system B. For example, if the difference between the two clocks is 125 ppm (parts per million) such that the clock in system A is faster than the clock in system B, every 8000 times that system A samples the analog signal, system B samples only 7999 times its analog signal. (125:1000000=1:8000). Hence, every second, system A must play 8000 digital samples toward the analog TELCO, but receives only 7999 samples. Thus, after a long time, the receive buffer of system A will be empty. In other words, there is under-flow.
System B experiences a similar, but opposite, phenomenon. Every second, as it is measured by the local clock of system B, 8001 samples will arrive from system A, but only 8000 samples will be sampled into an analog signal toward its TELCO. Thus, after a long time, the receive buffer of system B will be full. In other words, there is overflow.
Ignoring any other impairments between system A and system B, and assuming ideal processing, the clock difference between the two systems will cause a slight, but not audible, frequency shift. Sensitive frequency applications, such as a high bit rate data modem, might experience degradation in the link quality due to the frequency shift. However, the discontinuity that is associated with inserting additional samples when the receive buffer is empty, or deleting a sample when the receive buffer is full, causes degradation of the channel if the insertion of the additional sample or the elimination of the extra sample is not performed properly. The present invention presents a simple method to overcome the discontinuity that is associated with the clock differences. This family of algorithms is often referred to as “Clock Recovery” algorithms.
Observation of the jitter buffer behavior for a long time can distinguish between two phenomena—the delay jitter due to network impairments and a change in the number of frames in the jitter buffer due to clock differences. Network delay jitter causes the number of frames or samples in the jitter buffer to vary with time, but the long term averaging of the number of the frames in the jitter buffer will stay constant. A difference in clock rate, on the other hand, changes the long-term average of the number of samples in the jitter buffer. The jitter buffer of the slower system will increase the average number of frames in the jitter buffer as time progress, while the faster system's jitter buffer will decrease the number of frames.
Objects and Summary
A general object of an embodiment of the present invention is to provide a method, which improves the play-out of a jitter buffer.
Another object of an embodiment of the present invention is to analyze incoming packets and adjust the nominal delay associated with a jitter buffer at an appropriate time.
Still another object of an embodiment of the present invention is to improve upon the performance of a jitter buffer operation by adding an additional stage to a jitter buffer algorithm that determines the appropriate moment to increase or decrease the nominal delay associated with the jitter buffer.
Briefly, and in accordance with at least one of the foregoing objects, an embodiment of the present invention provides a method and apparatus which provides that, in a voice over network, incoming packets are analyzed and the appropriate moment to increase or decrease the nominal delay associated with a jitter buffer is determined. Hence, the nominal delay is adjusted at an appropriate moment based on network jitter characteristics. Preferably, the nominal delay is adjusted when voice activity is absent. The method and apparatus provide for improved play out of the jitter buffer, and provide improved performance.