Time-scaling an audio signal may be enabled for example in an audio receiver that is suited to receive encoded audio signals in packets via a packet switched network, such as the Internet, to decode the encoded audio signals and to playback the decoded audio signal to a user.
The nature of packet switched communications typically introduces variations to the transmission times of the packets, known as jitter, which is seen by the receiver as packets arriving at irregular intervals. In addition to packet loss conditions, network jitter is a major hurdle especially for conversational speech services that are provided by means of packet switched networks.
More specifically, an audio playback component of an audio receiver operating in real-time requires a constant input to maintain a good sound quality. Even short interruptions should be prevented. Thus, if some packets comprising audio frames arrive only after the audio frames are needed for decoding and further processing, those packets and the included audio frames are considered as lost. The audio decoder will perform error concealment to compensate for the audio signal carried in the lost frames. Obviously, extensive error concealment will reduce the sound quality as well, though.
Typically, a jitter buffer is therefore utilized to hide the irregular packet arrival times and to provide a continuous input to the decoder and a subsequent audio playback component. The jitter buffer stores to this end incoming audio frames for a predetermined amount of time. This time may be specified for instance upon reception of the first packet of a packet stream. A jitter buffer introduces, however, an additional delay component, since the received packets are stored before further processing. This increases the end-to-end delay. A jitter buffer can be characterized by the average buffering delay and the resulting proportion of delayed frames among all received frames.
A jitter buffer using a fixed delay is inevitably a compromise between a low end-to-end delay and a low number of delayed frames, and finding an optimal tradeoff is not an easy task. Although there can be special environments and applications where the amount of expected jitter can be estimated to remain within predetermined limits, in general the jitter can vary from zero to hundreds of milliseconds—even within the same session. Using a fixed delay that is set to a sufficiently large value to cover the jitter according to an expected worst case scenario would keep the number of delayed frames in control, but at the same time there is a risk of introducing an end-to-end delay that is too long to enable a natural conversation. Therefore, applying a fixed buffering is not the optimal choice in most audio transmission applications operating over a packet switched network.
An adaptive jitter buffer can be used for dynamically controlling the balance between a sufficiently short delay and a sufficiently low number of delayed frames. In this approach, the incoming packet stream is monitored constantly, and the buffering delay is adjusted according to observed changes in the delay behavior of the incoming packet stream. In case the transmission delay seems to increase or the jitter is getting worse, the buffering delay is increased to meet the network conditions. In an opposite situation, the buffering delay can be reduced, and hence, the overall end-to-end delay is minimized.
Since the audio playback component needs a regular input, the buffer adjustment is not completely straightforward, though. A problem arises from the fact that if the buffering delay is reduced, the audio signal that is provided to the playback component needs to be shortened to compensate for the shortened buffering delay, and on the other hand, if the buffering delay is increased, the audio signal has to be lengthened to compensate for the increased buffering delay.
For Voice over IP (VoIP) applications, it is known to modify the signal in case of an increasing or decreasing buffer delay by discarding or repeating a part of the comfort noise signal between periods of active speech when discontinuous transmission (DTX) is enabled. However, such an approach is not always possible. For example, the DTX functionality might not be employed, or the DTX might not switch to a comfort noise due to challenging background noise conditions, such as an interfering talker in the background.
In a more advanced solution taking care of a changing buffer delay, a signal time scaling is employed to change the length of the output audio frames that are forwarded to the playback component. The signal time scaling can be realized either inside the decoder or in a post-processing unit after the decoder. In this approach, the frames in the jitter buffer are read more frequently by the decoder when decreasing the delay than during normal operation, while an increasing delay slows down the frame output rate from the jitter buffer.
In an audio receiver that is equipped with an adaptive jitter buffer and a time scaling functionality, the network status and the buffer status are monitored constantly. Based on the status of the buffer and the network, time scale modifications are performed on an audio signal, either by adding or by removing segment(s) of the audio signal, to compensate for any change in the buffer delay.
The challenge in performing time scale modifications in active parts of the audio signal is to keep the perceived audio quality at a sufficiently high level.