Several ways are known in the art to encode audio and video content. Generally, of course, the aim is to encode the content in a bit-saving manner without degrading the reconstruction quality of the signal.
Recently, new approaches to encode audio and video content have been developed, amongst which transform-based perceptual audio coding achieves the largest coding gain for stationary signals, that is when large transform sizes, can be applied. (See for example T. Painter and A. Spanias: “Perceptual coding of digital audio”, Proceedings of the IEEE, Vol. 88, No. 4, April 2000, pages 451-513). Stationary parts of audio are often well modelled by a fixed finite number of stationary sinusoids. Once the transform size is large enough to resolve those components, a fixed number of bits is required for a given distortion target. By further increasing the transform size, larger and larger segments of the audio signal will be described without increasing the bit demand. For non-stationary signals, however, it becomes necessary to reduce the transform size and thus the coding gain will decrease rapidly. To overcome this problem, for abrupt changes and transient events, transform size switching can be applied without significantly increasing the mean coding cost. That is, when a transient event is detected, the block size (frame size) of the samples to be encoded together is decreased. For more persistently transient signals, the bit rate will of course increase dramatically.
A particular interesting example for persistent transient behaviour is the pitch variation of locally harmonic signals, which is encountered mainly in the voiced parts of speech and singing, but can also originate from the vibratos and glissandos of some musical instruments. Having a harmonic signal, i.e. a signal having signal peaks distributed with equal spacing along the time axis, the term pitch describes the inverse of the time between adjacent peaks of the signal. Such a signal therefore has a perfect harmonic spectrum, consisting of a base frequency equal to the pitch and higher order harmonics. In more general terms, pitch can be defined as the inverse of the time between two neighbouring corresponding signal portions within a locally harmonic signal. However, if the pitch and thus the base frequency varies with time, as it is the case in voiced sounds, the spectrum will become more and more complex and thus more inefficient to encode.
A parameter closely related to the pitch of a signal is the warp of the signal. Assuming that the signal at time t has pitch equal to p(t) and that this pitch value varies smoothly over time, the warp of the signal at time t is defined by the logarithmic derivative
      a    ⁡          (      t      )        =                              p          ′                ⁡                  (          t          )                            p        ⁡                  (          t          )                      .  
For a harmonic signal, this definition of warp is insensitive to the particular choice of the harmonic component and systematic errors in terms of multiples or fractions of the pitch. The warp measures a change of frequency in the logarithmic domain. The natural unit for warp is Hertz [Hz], but in musical terms, a signal with constant warp a(t)=a0 is a sweep with a sweep rate of a0/log 2 octaves per second [oct/s]. Speech signals exhibit warps of up to 10 oct/s and mean warp around 2 oct/s.
As typical frame length (block length) of transform coders are so big, that the relative pitch change is significant within the frame, warps or pitch variations of that size lead to a scrambling of the frequency analysis of those coders. As, for a required constant bit rate, this can only be overcome by increasing the coarseness of quantization, this effect leads to the introduction of quantization noise, which is often perceived as reverberation.
One possible technique to overcome this problem is time warping. The concept of time-warped coding is best explained by imagining a tape recorder with variable speed. When recording the audio signal, the speed is adjusted dynamically so as to achieve constant pitch over all voiced segments. The resulting locally stationary audio signal is encoded together with the applied tape speed changes. In the decoder, playback is then performed with the opposite speed changes. However, applying the simple time warping as described above has some significant drawbacks. First of all, the absolute tape speed ends up being uncontrollable, leading to a violation of duration of the entire encoded signal and bandwidth limitations. For reconstruction, additional side information on the tape speed (or equivalently on the signal pitch) has to be transmitted, introducing a substantial bit-rate overhead, especially at low bit-rates.
The common approach of prior art methods to overcome the problem of uncontrollable duration of time-warped signals is to process consecutive non-overlapping segments, i.e. individual frames, of the signal independently by a time warp, such that the duration of each segment is preserved. This approach is for example described in Yang et. al. “Pitch synchronous modulated lapped transform of the linear prediction residual of speech”, Proceedings of ICSP '98, pages 591-594. A great disadvantage of such a proceeding is that although the processed signal is stationary within segments, the pitch will exhibit jumps at each segment boundary. Those jumps will evidently lead to a loss of coding efficiency of the subsequent audio coder and audible discontinuities are introduced in the decoded signal.
Time warping is also implemented in several other coding schemes. For example, US-2002/0120445 describes a scheme, in which signal segments are subject to slight modifications in duration prior to block-based transform coding. This is to avoid large signal components at the boundary of the blocks, accepting slight variations in duration of the single segments.
Another technique making use of time warping is described in U.S. Pat. No. 6,169,970, where time warping is applied in order to boost the performance of the long-term predictor of a speech encoder. Along the same lines, in US 2005/0131681, a pre-processing unit for CELP coding of speech signals is described which applies a piecewise linear warp between non-overlapping intervals, each containing one whitened pitch pulse. Finally, it is described in (R. J. Sluijter and A. J. E. M. Janssen, “A time warper for speech signals” IEEE workshop on Speech Coding'99, June 1999, pages 150-152) how to improve on speech pitch estimation by application of a quadratic time warping function to a speech frame.
Summarizing, prior art warping techniques share the problems of introducing discontinuities at frame borders and of requiring a significant amount of additional bit rate for the transmission of the parameters describing the pitch variation of the signal.