The present invention is in the field of audio encoding/decoding, especially of audio coding concepts utilizing multiple encoding domains.
In the art, frequency domain coding schemes such as MP3 or AAC are known. These frequency-domain encoders are based on a time-domain/frequency-domain conversion, a subsequent quantization stage, in which the quantization error is controlled using information from a psychoacoustic module, and an encoding stage, in which the quantized spectral coefficients and corresponding side information are entropy-encoded using code tables.
On the other hand there are encoders that are very well suited to speech processing such as the AMR-WB+ as described in 3GPP TS 26.290. Such speech coding schemes perform an LP (LP=Linear Predictive) filtering of a time-domain signal. Such an LP filtering is derived from a linear prediction analysis of the input time-domain signal. The resulting LP filter coefficients are then quantized/coded and transmitted as side information. The process is known as LPC (LPC=Linear Prediction Coding). At the output of the filter, the prediction residual signal or prediction error signal which is also known as the excitation signal is encoded using the analysis-by-synthesis stages of the ACELP encoder or, alternatively, is encoded using a transform encoder, which uses a Fourier transform with an overlap. The decision between the ACELP coding and the Transform Coded eXcitation coding, which is also called TCX, coding is done using a closed loop or an open loop algorithm.
Frequency-domain audio coding schemes such as the high efficiency-AAC encoding scheme, which combines an AAC coding scheme and a spectral band replication technique can also be combined with a joint stereo or a multi-channel coding tool which is known under the term “MPEG surround”.
On the other hand, speech encoders such as the AMR-WB+also have a high frequency enhancement stage and a stereo functionality.
Frequency-domain coding schemes are advantageous in that they show a high quality at low bitrates for music signals. Problematic, however, is the quality of speech signals at low bitrates. Speech coding schemes show a high quality for speech signals even at low bitrates, but show a poor quality for music signals at low bitrates.
Frequency-domain coding schemes often make use of the so-called MDCT (MDCT=Modified Discrete Cosine Transform). The MDCT has been initially described in J. Princen, A. Bradley, “Analysis/Synthesis Filter Bank Design Based on Time Domain Aliasing Cancellation”, IEEE Trans. ASSP, ASSP-34(5):1153-1161, 1986. The MDCT or MDCT filter bank is widely used in modern and efficient audio coders. This kind of signal processing provides the following advantages:
Smooth cross-fade between processing blocks: Even if the signal in each processing block is altered differently (e.g. due to quantization of spectral coefficients), no blocking artifacts due to abrupt transitions from block to block occur because of the windowed overlap/add operation.
Critical sampling: The number of spectral values at the output of the filter bank is equal to the number of time domain input values at its input and additional overhead values have to be transmitted.
The MDCT filter bank provides a high frequency selectivity and coding gain.
Those great properties are achieved by utilizing the technique of time domain aliasing cancellation. The time domain aliasing cancellation is done at the synthesis by overlap-adding two adjacent windowed signals. If no quantization is applied between the analysis and the synthesis stages of the MDCT, a perfect reconstruction of the original signal is obtained. However, the MDCT is used for coding schemes, which are specifically adapted for music signals. Such frequency-domain coding schemes have, as stated before, reduced quality at low bit rates for speech signals, while specifically adapted speech coders have a higher quality at comparable bit rates or even have significantly lower bit rates for the same quality compared to frequency-domain coding schemes.
Speech coding techniques such as the AMR-WB+ (AMR-WB+=Adaptive Multi-Rate WideBand extended) codec as defined in “Extended Adaptive Multi-Rate-Wideband (AMR-WB+) codec”, 3GPP TS 26.290 V6.3.0, 2005-06, Technical Specification, do not apply the MDCT and, therefore, can not take any advantage from the excellent properties of the MDCT which, specifically, rely in a critically sampled processing on the one hand and a crossover from one block to the other on the other hand. Therefore, the crossover from one block to the other obtained by the MDCT without any penalty with respect to bit rate and, therefore, the critical sampling property of MDCT has not yet been obtained in speech coders.
When one would combine speech coders and audio coders within a single hybrid coding scheme, there is still the problem of how to obtain a switch-over from one coding mode to the other coding mode at a low bit rate and a high quality.
Conventional audio coding concepts are usually designed to be started at the beginning of an audio file or of a communication. Using these conventional concepts, filter structures, as for example prediction filters, reach a steady state at a certain time the beginning of the encoding or decoding procedure. For a switched audio coding system, however, using for example transform based coding on the one hand, and speech coding according to a previous analysis of the input on the other hand, the respective filter structures are not actively and continuously updated. For example, speech coders can be solicited to be frequently restarted in a short period of time. Once restarted, a start up period starts over again, the internal states are reset to zero. The duration needed by, for example a speech coder to reach a steady state can be critical especially for the quality of the transitions.
Conventional concepts as for example the AMR-WB+, cf. “Extended Adaptive Multi-Rate-Wideband (AMR-WB+) codec”, 3GPP TS 26.290 V6.3.0, 2005-06, Technical specification, use a total reset of the speech coder when transiting or switching between the transform based coder and the speech coder.
The AMR-WB+ is optimized under the condition that it starts only one time when the signal is faded in, supposing that there are no intermediate stops or resets. Hence, all the memories of the coder can be updated on a frame by frame basis. In case the AMR-WB+ is used in the middle of a signal, a reset has to be called, and all memories used on the encoding or decoding side are set to zero. Therefore, conventional concepts have the problem that too long durations are applied before reaching a steady state of the speech coder, along with the introduction of strong distortions in the non-steady phases.
Another disadvantage of conventional concepts is that they utilize long overlapping segments when switching coding domains introducing overheads, which disadvantageously effects coding efficiency.