Time-scaling algorithms change the duration of an audio signal while retaining the signals local frequency content, resulting in the overall effect of speeding up or slowing down the perceived playback rate of a recorded audio signal without affecting the pitch or timbre of the original signal. In other words, the duration of the original signal is increased or decreased but the perceptually important features of the original signal remain unchanged; for the case of speech, the time-scaled signal sounds as if the original speaker has spoken at a quicker or slower rate; for the case of music, the time-scaled signal sounds as if the musicians have played at a different tempo. Time-scaling algorithms can be used for adaptive jitter buffer management (JBM) in VoIP applications or audio/video broadcast, audio/video postproduction synchronization and multi-track audio recording and mixing.
In voice over IP applications, the speech signal is first compressed using a speech encoder. In order to maintain the interoperability, voice over IP systems are usually built on top of open speech codecs. Such systems can be standardized, for instance in ITU-T or 3GPP codec (several standardized speech codec are used for VoIP: G.711, G.722, G.729, G.723.1, AMR-WB) or have a proprietary format (Speex, Silk, CELT). The encoded speech signal is packetized and transmitted in IP packets.
Packets will encounter variable network delays in VoIP, so the packets arrive at irregular intervals. In order to smooth such jitter, a jitter buffer management mechanism is usually required at the receiver, where the received packets are buffered for a while and played out sequentially at scheduled time. If the play-out time can be adjusted for each packet, then time scale modification may be required to ensure continuous play-out of voice data at the sound card.
As the delay is not a constant delay, time-scaling algorithms are used to stretch or compress the duration of a given received packet. In case of multi-channel VoIP applications including a jitter buffer management mechanism, in particular when the multi-channel audio codec is based on a mono codec which operates in dual/multi mono mode, i.e. one mono encoder/decoder is used for each channel, using an independent application of the time-scaling algorithm for each channel can lead to quality degradation, especially of the spatial sound image as the independent time-scaling will not guarantee that the spatial cues are preserved. In the audio/video broadcast and post-production application, time-scaling each channel separately may keep the synchronization between video and audio, but cannot guarantee the spatial cues are the same as the original one. The most important spatial cues for the spatial perception are the energy differences between channels, the time or phase differences between channels and the coherence or correlation between channels. As the time-scaling algorithms operate stretching and compression operation of the audio signal, the energy, delay and coherence between the time scaled channels may differ from the original ones.