Network jitter and packet loss conditions can cause a degradation in quality for conversational audio service in packet switched networks, such as the Internet. The nature of the packet switched communications typically introduces variation in transmission of times of the packets, known as jitter, which is seen by the receiver as packets transmitted at regular intervals arriving at irregular intervals. On the other hand, an audio playback device requires that a constant input is maintained with no interruptions in order to ensure good sound quality. Thus, if some packets arrive after they are needed for decoding and playback, the decoder is forced to consider those packets as lost and to perform error concealment.
Typically a jitter buffer is utilized to store incoming frames for short period of time (e.g. by applying a predetermined buffering delay for the first packet of a stream) to hide the irregular arrival times and provide constant input to the decoder and audio playback device. However, jitter buffers typically store a number of received packets before the decoding process. This introduces an additional delay component and thereby increases the overall end to end delay in the communication chain. Such buffering can be characterized by the (average) buffering delay and the resulting proportion of delayed frames.
A shortcoming of this basic approach is that a jitter buffer with fixed playback timing (i.e. fixed buffer) is inevitably a compromise between low enough buffering delay and low enough number of delayed frames, and finding an optimal trade off is not a trivial task. Although there can be special environments and applications where the amount of expected jitter can be estimated to remain between predetermined limits. However, in general the network delay asscociated with jitter may vary from a scintilla of time to hundreds of milliseconds, even within the same session. Therefore, even in cases where the properties of the transmission channel are well known, it is generally not possible to set the initial buffering delay (applied for the first frame of a session) in such a way that the buffering performance throughout the session is optmised. This implies that using a fixed buffer with the initial buffering delay which is set to large enough value to cover the jitter according to expected worst case scenario would keep the number of delayed packets in control, however at the same time there is a risk of introducing end-to-end delay that is too long to enable a natural conversation. Therefore, applying a fixed buffer is far from optimal in most audio transmission applications operating over a packet switched network. For example, FIG. 1a depicts a situation where a fixed buffer is applied. Since the initial buffering delay was selected to be too small, frames 101,102 arrive after they are needed for further processing (e.g. playback or transmission further) and are therefore lost due to late arrival.
An adaptive jitter buffer can be used to dynamically control the balance between short enough delay and low enough number of delayed frames. In this approach the entity controlling the buffer constantly monitors the incoming packet stream and adjusts the buffering delay according to observed changes in the delay behaviour. If the transmission delay appears to increase or the jitter condition deteriorates, the buffering can be increased to meet the prevailing network conditions. In the opposite situation, where the network jitter condition improves, the buffering delay can be reduced, and hence, the overall end-to-end delay can be minimised.
Since the audio playback device typically needs regular input, buffer adjustment is not a straightforward task. A problem arises from the fact that if the buffering is reduced, the audio signal given to the playback device needs to be shortened to compensate the shortened buffering, and vice versa for the case of increasing the buffering a segment of audio signal needs to be inserted.
Prior-art methods for modifying the signal when increasing or decreasing the buffer may typically consist of repeating or discarding comfort noise signal frames within the buffer between active speech bursts. Further, advanced solutions may involve to operating during active speech regions utilizing signal time scaling in time-domain either inside the decoder or as a post-processing stage after the decoder in order to change the length of the output audio. In this approach, the buffer size is reduced when frames are temporarily retrieved more frequently due to faster play out. On the other hand, buffer size is increased when frame play out is temporarily slowed down.
The challenge in time scale modification during active signal content is to keep the perceived audio quality at a good enough level. Pitch-synchronous mechanisms, such as Pitch Synchronous Overlap-Add (PSOLA)—are typically used to provide time scale modifications with good voice quality at relatively low complexity. In practice this usually means either repeating or removing full pitch periods of signal and ‘smoothing’ the point of discontinuity to hide the possible quality effect caused by the time scale modifications, as exemplarily depicted in FIGS. 1b, 1c and 1d. Synchronous methods provide good results when used with monophonic and quasi-periodic signals, such as speech. An especially favourable approach to providing high-quality time scaling capability is to combine a pitch-synchronous scaling technique with the speech decoder. For example when used in conjunction with speech codecs such as Adaptive Multi-Rate (AMR), provides clear benefits in terms of low processing loads.
The described time scaling method to enable variable speech and audio playback for jitter buffer management is mainly for the terminal and transcoder equipment. That is, to be able to perform the time scaling operation on a sample by sample basis in time-domain, the (decoded) time-domain signal needs to be available. However, in case the IP transport terminates without transcoding, e.g. in transcoder free operation (TrFo) or tandem free operation (TFO) in a media gateway (MGW) or a terminal equipment, time scaling on sample basis cannot be utilized. A MGW or a terminal without transcoding can only adapt on a frame basis, i.e. inserting concealment frames or dropping frames, preferably during comfort noise periods. This results in decreased flexibility of the jitter buffer adaptation scheme, and also time based scaling during active speech is not possible without severe risk of voice quality distortion.