The present invention is related to audio encoding and decoding and specifically for encoding/decoding of audio signal having a harmonic or speech content, which can be subjected to a time warp processing.
In the following, a brief introduction will be given into the field of time warped audio encoding, concepts of which can be applied in conjunction with some of the embodiments of the invention.
In the recent years, techniques have been developed to transform an audio signal into a frequency domain representation, and to efficiently encode this frequency domain representation, for example taking into account perceptual masking thresholds. This concept of audio signal encoding is particularly efficient if the block length, for which a set of encoded spectral coefficients are transmitted, are long, and if only a comparatively small number of spectral coefficients are well above the global masking threshold while a large number of spectral coefficients are nearby or below the global masking threshold and can thus be neglected (or coded with minimum code length).
For example, cosine-based or sine-based modulated lapped transforms are often used in applications for source coding due to their energy compaction properties. That is, for harmonic tones with constant fundamental frequencies (pitch), they concentrate the signal energy to a low number of spectral components (sub-bands), which leads to an efficient signal representation.
Generally, the (fundamental) pitch of a signal shall be understood to be the lowest dominant frequency distinguishable from the spectrum of the signal. In the common speech model, the pitch is the frequency of the excitation signal modulated by the human throat. If only one single fundamental frequency would be present, the spectrum would be extremely simple, comprising the fundamental frequency and the overtones only. Such a spectrum could be encoded highly efficiently. For signals with varying pitch, however, the energy corresponding to each harmonic component is spread over several transform coefficients, thus leading to a reduction of coding efficiency.
In order to overcome this reduction of coding efficiency, the audio signal to be encoded is effectively resampled on a non-uniform temporal grid. In the subsequent processing, the sample positions obtained by the non-uniform resampling are processed as if they would represent values on a uniform temporal grid. This operation is commonly denoted by the phrase ‘time warping’. The sample times may be advantageously chosen in dependence on the temporal variation of the pitch, such that a pitch variation in the time warped version of the audio signal is smaller than a pitch variation in the original version of the audio signal (before time warping). This pitch variation may also be denoted with the phrase “time warp contour”. After time warping of the audio signal, the time warped version of the audio signal is converted into the frequency domain. The pitch-dependent time warping has the effect that the frequency domain representation of the time warped audio signal typically exhibits an energy compaction into a much smaller number of spectral components than a frequency domain representation of the original (non time warped) audio signal.
At the decoder side, the frequency-domain representation of the time warped audio signal is converted back to the time domain, such that a time-domain representation of the time warped audio signal is available at the decoder side. However, in the time-domain representation of the decoder-sided reconstructed time warped audio signal, the original pitch variations of the encoder-sided input audio signal are not included. Accordingly, yet another time warping by resampling of the decoder-sided reconstructed time domain representation of the time warped audio signal is applied. In order to obtain a good reconstruction of the encoder-sided input audio signal at the decoder, it is desirable that the decoder-sided time warping is at least approximately the inverse operation with respect to the encoder-sided time warping. In order to obtain an appropriate time warping, it is desirable to have an information available at the decoder which allows for an adjustment of the decoder-sided time warping.
As it is typically needed to transfer such an information from the audio signal encoder to the audio signal decoder, it is desirable to keep a bit rate needed for this transmission small while still allowing for a reliable reconstruction of the needed time warp information at the decoder side.
In view of the above discussion, there is a desire to create a concept which allows for a bitrate efficient application of the time warp concept in an audio encoder.