Embodiments according to the invention are related to an audio signal decoder. Further embodiments according to the invention are related to an audio signal encoder. Further embodiments according to the invention are related to an encoded multi-channel audio signal representation. Further embodiments according to the invention are related to a method for providing a decoded multi-channel audio signal representation, to a method for providing an encoded representation of a multi-channel audio signal, and to a computer program for implementing said methods.
Some embodiments according to the invention are related to methods for a time warped MDCT transform coder.
In the following, a brief introduction will be given into the field of time warped audio encoding, concepts of which can be applied in conjunction with some of the embodiments of the invention.
In the recent years, techniques have been developed to transform an audio signal into a frequency domain representation, and to efficiently encode this frequency domain representation, for example taking into account perceptual masking thresholds. This concept of audio signal encoding is particularly efficient if the block lengths, for which a set of encoded spectral coefficients are transmitted, are long, and if only a comparatively small number of spectral coefficients are well above the global masking threshold while a large number of spectral coefficients are nearby or below the global masking threshold and can thus be neglected (or coded with minimum code length).
For example, cosine-based or sine-based modulated lapped transforms are often used in applications for source coding due to their energy compaction properties. That is, for harmonic tones with constant fundamental frequencies (pitch), they concentrate the signal energy to a low number of spectral components (sub-bands), which leads to an efficient signal representation.
Generally, the (fundamental) pitch of a signal shall be understood to be the lowest dominant frequency distinguishable from the spectrum of the signal. In the common speech model, the pitch is the frequency of the excitation signal modulated by the human throat. If only one single fundamental frequency would be present, the spectrum would be extremely simple, comprising the fundamental frequency and the overtones only. Such a spectrum could be encoded highly efficiently. For signals with varying pitch, however, the energy corresponding to each harmonic component is spread over several transform coefficients, thus leading to a reduction of coding efficiency.
In order to overcome this reduction of the coding efficiency, the audio signal to be encoded is effectively resampled on a non-uniform temporal grid. In the subsequent processing, the sample positions obtained by the non-uniform resampling are processed as if they would represent values on a uniform temporal grid. This operation is commonly denoted by the phrase “time warping”. The sample times may be advantageously chosen in dependence on the temporal variation of the pitch, such that a pitch variation in the time warped version of the audio signal is smaller than a pitch variation in the original version of the audio signal (before time warping). After time warping of the audio signal, the time warped version of the audio signal is converted into the frequency domain. The pitch-dependent time warping has the effect that the frequency domain representation of the time warped audio signal is typically concentrated into a much smaller number of spectral components than a frequency domain representation of the original (non time warped) audio signal.
At the decoder side, the frequency-domain representation of the time warped audio signal is converted back to the time domain, such that a time-domain representation of the time warped audio signal is available at the decoder side. However, in the time-domain representation of the decoder-sided reconstructed time warped audio signal, the original pitch variations of the encoder-sided input audio signal are not included. Accordingly, yet another time warping by resampling of the decoder-sided reconstructed time domain representation of the time warped audio signal is applied. In order to obtain a good reconstruction of the encoder-sided input audio signal at the decoder, it is desirable that the decoder-sided time warping is at least approximately the inverse operation with respect to the encoder-sided time warping. In order to obtain an appropriate time warping, it is desirable to have an information available at the decoder which allows for an adjustment of the decoder-sided time warping.
As it is typically necessitated to transfer such an information from the audio signal encoder to the audio signal decoder, it is desirable to keep a bit rate needed for this transmission small while still allowing for a reliable reconstruction of the necessitated time warp information at the decoder side.
In view of the above discussion, there is a desire to have a concept which allows for a bit-rate-efficient storage and/or transmission of a multi-channel audio signal.