The present invention relates generally to improving the quality of audio data sent over packet switched networks, and more particularly to preventing overflow/underflow at a jitter buffer to minimize undesirable audio effects.
Various methods have been developed to deliver real-time audio data from an audio source to a destination over packet-switched networks. These techniques, such as voice over Internet Protocol (VoIP) and voice over frame relay (VoFR), often provide a number of advantages to using circuit-switched networks, such as a public switched telephone network (PSTN). For one, the cost of using packet-switched networks to transport audio data often is less expensive than with circuit-switched networks. Another advantage is that packet-switched networks often provide redundancy paths since there are typically numerous possible paths between the source and destination, thereby making packet-switched audio transmission techniques more resilient to node failures in a network.
However, in spite of the many advantages offered by transmitting audio data via packet-switched networks, a number of drawbacks exist. Referring now to FIG. 1, some of the limitations of known packet-switched audio transmission methods are illustrated. One limitation typically exhibited by packet-switched networks includes latency and jitter. Since the protocols often used to provide packet-switched voice data, such as IP, generally do not provide any type of quality of service (QoS), there is no guarantee that the packets containing the audio data will be transported to a destination either in order or within a maximum time period. Likewise, typically there is no guarantee that the packets will have the same latency, often resulting in jitter, where the term jitter commonly refers to the variation in the latency between packets.
To illustrate, assume that system 100 includes an audio source 110 and a destination system 130. The audio source 110 and the destination system 130 can include any of a variety of data terminal devices, such as a personal computer, a laptop computer, a digital telephone, a video teleconferencing system, and the like. In this case, the audio source 110 can encode an audio source signal into digital audio data using an audio encoder 106, packetize the audio data into packets 102, 104, and transmit the packets 102, 104 via a network 120 to the destination system 130 for decoding and subsequent output as an audio signal by an audio decoder 134.
However, networks, such as the Internet, often have numerous possible paths between a source and a destination, as well as varying traffic and changes in the statuses of nodes. As a result, packet 102 could take a different path through network 120 than the path taken by packet 104. As a result, there could be a significant difference between the transmission latency of packet 102 and the transmission latency of packet 104. For example, assume that packet 104 is transmitted by the audio source 110 50 milliseconds (time 112) after packet 102. However, in this example, packet 102 is transmitted along path 122 of network 120 for a transmission time of 250 ms, and packet 104 is transmitted along path 144 of network 120 for a transmission time of 300 ms. As a result, packet 102 is received by the destination system 130 100 ms (time 114) before packet 104 is received, rather than the original 50 ms time difference (time 112). This latency can vary for subsequent packets and can even cause packets to arrive out-of-order at the destination system 130.
As a result of the potential for jitter on packet-switched networks, many destination systems, such as destination system 130, typically implement a jitter buffer 132 to buffer the incoming packets to minimize the effects of the jitter introduced by the network 120. However, because of the varying latencies of packets transmitted to the destination system 130 and stored in the jitter buffer 132, the jitter buffer 132 can overflow or underflow, resulting in a significant delay before the jitter buffer 132 can pass audio data to the audio/voice decoder for output. In voice/audio applications, this delay often introduces undesirable audio components to the output, such as audible clicks, that often degrade the quality of the audio output. For example, a jitter buffer overflow often results in dropped packets, thereby causing a discontinuity in the output. Underflow of a jitter buffer typically results in a shifting in the time domain of the audio output while the jitter buffer replenishes, causing silence at the output during the replenishment period.
The potential for overflow/underflow of the jitter buffer 132 often is exacerbated when there is a difference between the sampling rate of the audio encoder 106 at the audio source 110 and the sampling rate of the audio decoder 134 at the destination system 130; this difference is referred to herein as clock skew. The clock skew can result from a difference between the nominal sampling rates of the audio encoder 106 and the audio decoder 134. For example, the audio encoder 106 could be set to encode an audio signal at a rate of 8,000 samples per second and the audio decoder 134 could be set to decode audio data at a rate of 7,900 samples per second, resulting in a clock skew of 100 samples per second. Alternatively, the clock skew can be a result of the variance of the clocks used by the audio encoder 106 and the audio decoder 134. For example, both audio encoder 106 and the audio decoder 134 could be adapted to encode/decode at a rate of 8000 samples per second. However, because of a 0.1% variance in the clocks of the encoder 106 and the decoder 134, there could be a total clock skew of between 0.0% and 0.2%, or 0 to 16 samples/second.
In the event that the sampling rate of the audio decoder 134 is greater than the sampling rate of the audio encoder 106, the jitter buffer 132 generally would eventually underflow since the packets are output from the jitter buffer 132 at a greater rate than other packets are input. As a result, packets typically would be dropped by the jitter buffer 132, resulting in periods of silence until the jitter buffer 132 repopulates. Likewise, if the sampling rate of the audio decoder 134 is less than the sampling rate of the audio encoder 106, then the jitter buffer 132 would eventually overflow since packets are input to the jitter buffer faster than previously received packets are removed, resulting in dropped packets and degraded output audio quality.
Accordingly, a system and/or method to compensate for a clock skew between an audio encoder and an audio decoder would be advantageous.