It is well known in packet-based terminals and devices, such as wireless communications terminals (e.g., mobile and cellular telephones or personal communicators), PC-based terminals as well as IP telephony gateways, that an audio device requests data to be converted into audio at regular, fixed intervals. These intervals are not, however, synchronized to the reception of the data packets that contain the audio data. A given packet can contain one or more frames of data, where the length or duration of an audio signal contained within the frame is generally in the range of 20 ms to 30 ms (referred to herein generally as the “frame length”, although a temporal measure is intended, not a spatial measure.) After reception, the audio data frame is typically stored into a jitter buffer to await its calculated playout time. The playout time is the time during which the frame of audio data is to be converted to an audio signal, such as by a digital-to-analog converter (DAC), then amplified and reproduced for a listener through a speaker or some other type of audio transducer. In the case of gateways and transcoders, the audio is typically sent to a sample-based circuit switched network. In that the audio device requests the frame data at random intervals, relative to the receipt of the audio packets, the data can be stored for a variable amount of time in the jitter buffer. The storage time in the jitter buffer can be shown to be one half of the duration of the frame in addition to the desired jitter buffer duration. In FIG. 2 this is demonstrated: the packet resides in the jitter buffer first for the desired 10 ms, after which it is playable, the frame, however will be fetched at some time during the next 20 ms, resulting in the undesired average of 10 ms of additional storage time in the jitter buffer.
A problem arises because of the fact that in modern voice terminals and similar devices, such as IP telephony gateways, the audio device is synchronized to some local frequency source. The frequency source may be, for example, an oscillator or a telephone network clock signal. However, in packet-based terminals, the packets containing the voice data arrive at a rate that is independent of and asynchronous to the frequency source that drives the audio device. The difference between the rate of IP packet arrival and the rate at which the audio device requests frames of voice data can create an undesirable and variable “synchronization delay”.
Furthermore, due to slight differences in clock rates this difference between the rate of IP packet arrival and the rate at which the audio device requests frames of voice data can vary over time, thus constituting a continuous re-synchronization problem.
The prior commonly assigned application referred to above, of which this application is a continuation-in-part, describes a system and method wherein synchronization is performed at the start of a talk spurt, and not continuously. However, with long talk spurts this may be a less than optimum approach if the synchronization cannot be performed in timely manner. Furthermore, this is a problem that can be difficult to handle in a controlled way if the speech codec is used without silence compression.
In EP 0 921 666 A2 Ward et al. are said to reduce degradation in packetized voice communications that are received by a non-synchronized entity from a packet network by adjusting a depth of storage of a jitter buffer in the receiver. Units of voice sample data are stored in the jitter buffer as they are received. From time to time the rate of extraction of the stored units from the jitter buffer is accelerated by extracting two units, but delivering only one, or is retarded by not extracting a unit, while delivering a substitute unit in its place. This technique is said to control the depth of storage in response to packet reception events such that the delay is minimized, while providing a sufficient amount of delay to smooth the variances between packet reception events.
In WO 01/11832 A1 Nakabayashi describes the use of a receive buffer that stores packets received from a network interface, and a reproduction controller that refers to the state of the receive buffer to carry out a sound reproduction operation. A decoder receives the stored data, and the decoded data is provided to a DAC that is clocked by a reproduce clock. The process is said to prevent to the underflow and overflow of the receive buffer due to clock differences between the transmitter and the receiver, and to prevent packet jitter that results in sound dropouts.
In U.S. Pat. No. 6,181,712 B1 Rosengren describes transmitting packets from an input stream to an output stream. When multiplexing transport streams, packet jitter may be introduced to the extent that decoder buffers can underflow or overflow. To avoid this, a time window is associated with a data packet and position information is provided in the packet concerning the position of the packet within the window.