The development of telecommunication networks and services is continuously leading to a situation, wherein finally all services are provided in packet switched networks. The transport of real-time services over packet switched network, especially over wireless links, provides a challenge with variable transmission delay and packet losses. To enable e.g. real-time bi-directional audio services, a buffering scheme is needed in the receiving side to mitigate the delay variations, i.e. network jitter.
Network jitter, caused by the variation in transmission times of the packets, is seen by the receiver as packets arriving at irregular intervals. On the other hand, an audio playback device requires constant input to maintain good sound quality, and no interruptions can be allowed. Typically a jitter buffer is utilized to store incoming frames for a short period of time to hide the irregular arrival times and provide constant input to the decoder and audio playback device. The jitter buffer introduces, however, an additional delay component increasing the end-to-end delay since the received packets are stored before the decoding process. Furthermore, a jitter buffer with a fixed delay is inevitably a compromise between a short enough buffering delay and a low enough number of delayed frames.
To alleviate the problems of a fixed delay jitter buffer, an adaptive jitter buffer can be used for dynamically controlling the balance between short enough delay and low enough number of delayed frames. Thus, the buffering delay is adjusted according to observed changes in the delay behavior. If the transmission delay seems to increase or the jitter condition is getting worse, the buffering is increased to meet the network conditions; if the network jitter condition is improving, the buffering can be reduced. As a result, the overall end-to-end delay is minimized.
Since the audio playback device needs regular input, the buffer adjustment is anything but straightforward. The problem arises from the fact that if the buffering is reduced, the audio signal given to the playback device needs to be shortened to compensate the shortened buffering, and on the other hand, if the buffering is increased, a segment of audio signal needs to be inserted. An advanced solution to this problem is to utilize signal time scaling during active speech. In this approach, the buffer size is reduced when frames are retrieved more frequently due to faster playout. On the other hand, buffer size is increased when frame playout is slowed down.
The challenge in time scale modification during active signal content is to keep the perceived audio quality at a good enough level. Pitch-synchronous mechanisms, such as Pitch Synchronous Overlap-Add (PSOLA), are typically used to provide time scale modification with good voice quality at relatively low complexity. In practice this usually means either repeating or removing full pitch periods of signal and ‘smoothing’ the point of discontinuity to hide the possible quality defects caused by the time scale modification. Synchronous methods provide good results when used with monophonic and quasi-periodic signals, such as speech.
However, time scaling of multi-channel audio is problematic since even a very small phase difference relative to other channels significantly affects the overall spatial image. Inter channel time differences (ICTD) are crucial for perceptual spatial image reconstruction. Hence, all the channels need to be scaled in a synchronous manner. However, in general multi-channel conditions, e.g. in a teleconference having several speakers talking in different channels at the same time, there is no common pitch for all channels available. Thus, time scaling of a multi-channel teleconference audio signal channel-by-channel-basis would lead to an unfortunate situation where a listener would perceive an audio image, wherein the voices of the conference participants were jumping from one place to another. In addition, the room effect with e.g. reflections makes the problem even more difficult. Thus, the lack of feasible time-scaling methods for multi-channel audio retards the deployment of many spatially encoded audio applications.