It is well known in packet-based terminals and devices, such as wireless communications terminals (e.g., mobile and cellular telephones or personal communicators), PC-based terminals as well as IP telephony gateways, that an audio device requests data to be converted into audio at regular, fixed intervals. These intervals are not, however, synchronized to the reception of the data packets that contain the audio data. A given packet can contain one or more frames of data, where the length or duration of an audio signal contained within the frame is generally in the range of 20 ms to 30 ms (referred to herein generally as the “frame length”, although a temporal measure is intended, not a spatial measure.) After reception, the audio data frame is typically stored into a jitter buffer to await its calculated playout time. The playout time is the time during which the frame of audio data is to be converted to an audio signal, such as by a digital-to-analog converter (DAC), then amplified and reproduced for a listener through a speaker or some other type of audio transducer. In the case of gateways and transcoders, the audio is typically sent to a sample-based circuit switched network. In that the audio device requests the frame data at random intervals, relative to the receipt of the audio packets, the data can be stored for a variable amount of time in the jitter buffer. The average storage time in the jitter buffer can be shown to be one half of the duration of the frame, in addition to the desired jitter buffer duration. For example, it can be demonstrated that if a packet resides in the jitter buffer first for a desired 10 ms, after which it is playable, the frame, however will be fetched at some time during the next 20 ms, resulting in the undesired average of 10 ms of additional storage time in the jitter buffer.
A problem arises because of the fact that in modern voice terminals and similar devices, such as IP telephony gateways, the audio device is synchronized to some local frequency source. The frequency source may be, for example, an oscillator or a telephone network clock signal. However, in packet-based terminals, the packets containing the voice data arrive at a rate that is independent of, and asynchronous to, the frequency source that drives the audio device. The difference between the rate of IP packet arrival and the rate at which the audio device requests frames of voice data can create an undesirable and variable end-to-end delay, also referred to as “synchronization delay”, which can be as great as a packet length in duration. Voice-over-IP (VoIP) applications can be especially detrimentally affected by synchronization delay-induced problems.
Furthermore, due to slight differences in clock rates this difference between the rate of IP packet arrival and the rate at which the audio device requests frames of voice data can vary over time, thus constituting a continuous re-synchronization problem. Typically, transmitter and receiver clocks running at different frequencies repeatedly introduce an underflow or overflow situation in the jitter buffer of a VoIP receiver. Because even short gaps or discontinuities in the audio playback cannot be tolerated, the receiver needs to somehow react to this condition. In practice, the receiver needs to perform re-synchronization, either by artificially generating a short segment of extra signal in the case of underflow, or by discarding some of the received signal in the case of overflow. However, the synchronization process should be performed with great care in order to avoid generating audible discontinuities in the reconstructed speech signal.
In EP 0 921 666 A2 Ward et al. are said to reduce degradation in packetized voice communications that are received by a non-synchronized entity from a packet network by adjusting a depth of storage of a jitter buffer in the receiver. Units of voice sample data are stored in the jitter buffer as they are received. From time to time the rate of extraction of the stored units from the jitter buffer is accelerated by extracting two units, but delivering only one, or is retarded by not extracting a unit, while delivering a substitute unit in its place. This technique is said to control the depth of storage in response to packet reception events such that the delay is minimized, while providing a sufficient amount of delay to smooth the variances between packet reception events.
In WO 01/11832 A1 Nakabayashi describes the use of a receive buffer that stores packets received from a network interface, and a reproduction controller that refers to the state of the receive buffer to carry out a sound reproduction operation. A decoder receives the stored data, and the decoded data is provided to a DAC that is clocked by a reproduce clock. The process is said to prevent to the underflow and overflow of the receive buffer due to clock differences between the transmitter and the receiver, and to prevent packet jitter that results in sound dropouts.
In U.S. Pat. No. 6,181,712 B1 Rosengren describes transmitting packets from an input stream to an output stream. When multiplexing transport streams, packet jitter may be introduced to the extent that decoder buffers can underflow or overflow. To avoid this, a time window is associated with a data packet and position information is provided in the packet concerning the position of the packet within the window.
The foregoing prior art techniques do not provide an adequate solution to the synchronization delay problem in VoIP and other applications.
Commonly assigned U.S. patent application Ser. No. 09/946,066, filed Sep. 4, 2001, entitled “Method and Apparatus for Reducing Synchronization Delay in Packet-Based Voice Terminals”, by Jari Selin, describes a system and method wherein synchronization is performed at the start of a talk spurt, and not continuously.
Commonly assigned U.S. patent application Ser. No. 10/189,068, filed Jul. 2, 2002, entitled “Method and Apparatus for Reducing Synchronization Delay in Packet-Based Voice Terminals by Resynchronizing During Talk Spurts”, by Ari Lakaniemi, Jari Selin and Pasi Ojala, which is a continuation-in-part of the foregoing application, describes a method that operates, when a frame containing audio data is sent to a decoder, by measuring the synchronization delay, determining by how much the synchronization delay should be adjusted and adjusting the synchronization delay in a content-aware manner by adding or removing one or more audio samples in a selected current frame, or in a selected subsequent frame, so as not to significantly degrade the quality of the played back audio data. When the synchronization delay is adjusted by more than one audio sample, the adjustment can be made by all of the determined audio samples in one adjustment, or the adjustment can be made by less than all of the determined audio samples by using a plurality of adjustments. The adjusting operation selects, if possible, an unvoiced frame and discriminates against a transient frame. The determining operation can include measuring an average amount of time that a frame resides in the jitter buffer, and adjusting the synchronization delay so that the average duration approaches a desired jitter buffer residency duration.
While the two foregoing approaches provide a significant advance over the prior art, it would be desirable to even further enhance the ability of a packet-based terminal to overcome the problems related to synchronization delay in order to even further improve the quality of audio delivered to a user.