The present invention is in the field of coding, where different characteristics of data to be encoded are utilized for coding rates, as for example in video and audio coding.
State of the art coding strategies can make use of characteristics of a data stream to be encoded. For example, in audio coding, perception models are used in order to compress source data almost without decreasing the noticeable quality and degradation when replayed. Modern perceptual audio coding schemes, such as for example, MPEG-2/4 AAC (MPEG=Moving Pictures Expert Group, AAC=Advanced Audio Coding), cf. Generic Coding of Moving Pictures and Associated Audio: Advanced Audio Coding, International Standard 13818-7, ISO/IEC JTC1/SC29/WG11 Moving Pictures Expert Group, 1997, may use filter banks, such as for example the Modified Discrete Cosine Transform (MDCT), for representing the audio signal in the frequency domain.
In the frequency domain quantization of frequency coefficients can be carried out, according to a perceptual model. Such coders can provide excellent perceptual audio quality for general types of audio signals as, for example, music. On the other hand, modern speech coders, such as, for example, ACELP (ACELP=Algebraic Code Excited Linear Prediction), use a predictive approach, and in this way may represent the audio/speech signal in the time domain. Such speech coders can model the characteristics of the human speech production process, i.e. the human vocal tract and, consequently, achieve excellent performance for speech signals at low bit rates. Conversely, perceptional audio coders do not achieve the level of performance offered by speech coders for speech signals coded at low bit rates, and using speech coders to represent general audio signals/music results in significant quality impairments.
Conventional concepts provide a layered combination in which all partial coders are active, i.e. time-domain and frequency-domain encoders, and the final output signal is calculated by combining the contributions of the partial coders for a given processed time frame. A popular example of layered coding are MPEG-4 scalable speech/audio coding with a speech coder as the base layer and a filterbank-based enhancement layer, cf. Bernhard Grill, Karlheinz Brandenburg, “A Two- or Three-Stage Bit-Rate Scalable Audio Coding System”, Preprint Number 4132, 99th Convention of the AES (September 1995).
Conventional frequency-domain encoders can make use of MDCT filterbanks. The MDCT has become a dominant filterbank for conventional perceptual audio coders because of its advantageous properties. For example, it can provide a smooth cross-fade between processing blocks. Even if a signal in each processing block is altered differently, for example due to quantization of spectral coefficients, no blocking artifacts due to abrupt transitions from block to block occur because of the windowed overlap/add operations.
The MDCT uses the concept of time-domain aliasing cancellation (TDAC).
The MDCT is a Fourier-related transform based on the type-IV discrete cosine transform, with the additional property of being lapped. It is designed to be performed in consecutive blocks of a larger data set, where subsequent blocks are overlapped so that the last half of one block coincides with the first half of the next block. This overlapping, in addition to an energy-compaction quality of the DCT, makes the MDCT especially attractive for signal compression applications, since it helps to avoid said artifacts stemming from the block boundaries. As a lapped transform, the MDCT is a bit unusual compared to other Fourier-related transforms in that it has half as many outputs as inputs, instead of the same number. In particular, 2N real numbers are transformed into N real numbers, where N is a positive integer.
The inverse MDCT is also known as IMDCT. Because there are different numbers of inputs and outputs, at first glance it might seem that the MDCT should not be invertible. However, perfect invertibility is achieved by adding the overlap IDMCTs of subsequent overlapping blocks, causing the errors to cancel and the original data to be retrieved, i.e. achieving TDAC.
Therewith, the number of spectral values at the output of a filterbank is equal to the number of time-domain input values at its input which is also referred to as critical sampling.
An MDCT filterbank provides a high-frequency selectivity and enables a high coding gain. The properties of overlapping of blocks and critical sampling can be achieved by utilizing the technique of time-domain aliasing cancellation, cf. J. Princen, A. Bradley, “Analysis/Synthesis Filter Bank Design Based on Time Domain Aliasing Cancellation”, IEEE Trans. ASSP, ASSP-34(5):1153-1161, 1986. FIG. 4 illustrates these effects of an MDCT. FIG. 4 shows an MDCT input signal, in terms of an impulse along a time axis 400 at the top. The input signal 400 is then transformed by two consecutive windowing and MDCT blocks, where the windows 410 are illustrated underneath the input signal 400 in FIG. 4. The back transformed individual windowed signals are displayed in FIG. 4 by the time lines 420 and 425.
After the inverse MDCT, the first block produces an aliasing component with positive sign 420, the second block produces an aliasing component with the same magnitude and a negative sign 425. The aliasing components cancel each other after addition of the two output signals 420 and 425 as shown in the final output 430 at the bottom of FIG. 4.
In “Extended Adaptive Multi-Rate-Wideband (AMR-WB+) codec”, 3GPP TS 26.290V6.3.0, 2005 June, Technical Specification the AMR-WB+(AMR-WB=Adaptive Multi-Rate Wideband) codec is specified. According to section 5.2, the encoding algorithm at the core of the AMR-WB+ codec is based on a hybrid ACELP/TCX (TCX=Transform coded Excitation) model. For every block of an input signal the encoder decides, either in an open loop or a closed loop mode which encoding model, i.e. ACELP or TCX, is best. The ACELP model is a time-domain, predictive encoder, best suited for speech and transient signals. The AMR-WB encoder is used in ACELP modes. Alternatively, the TCX model is a transform based encoder, and is more appropriate for typical music samples.
Specifically, the AMR-WB+ uses a discrete Fourier transform (DFT) for the transform coding mode TCX. In order to allow a smooth transition between adjacent blocks, a windowing and overlap is used. This windowing and overlap is useful both for transitions between different coding modes (TCX/ACELP) and for consecutive TCX frames. Thus, the DFT together with the windowing and overlap represents a filterbank that is not critically sampled. The filterbank produces more frequency values than the number of new input samples, cf. FIG. 4 in 3GPP TS 26.290V6.3.0 (3GPP=Third Generation Partnership Project, TS=Technical Specification). Each TCX frame utilizes an overlap of ⅛ of the frame length which equals the number of new input samples. Consequently, the corresponding length of the DFT is 9/8 of the frame length.
Considering the non-critically sampled DFT filterbank in the TCX, i.e. the number of spectral values at the output of the filterbank is larger than the number of time-domain input values at its input, this frequency domain coding mode is different from audio codecs such as AAC (AAC=Advanced Audio Coding) which utilizes an MDCT, a critically sampled lapped transform.
The Dolby E codec is described in Fielder, Louis D.; Todd, Craig C., “The Design of a Video Friendly Audio Coding System for Distributing Applications”, Paper Number 17-008, The AES 17th International Conference: High-Quality Audio Coding (August 1999) and Fielder, Louis D.; Davidson, Grant A., “Audio Coding Tools for Digital Television Distribution”, Preprint Number 5104, 108th Convention of the AES (January 2000). The Dolby E codec utilizes the MDCT filterbank. In the design of this coding, special focus was put on the possibility to perform editing in the coding domain. To achieve this, special alias-free windows are used. At the boundaries of these windows a smooth-cross fade or splicing of different signal portions is possible. In the above-referenced documents it is, for example, outlined, cf. section 3 of “The Design of a Video Friendly Audio Coding System for Distribution Applications”, that this would not be possible by simply using the usual MDCT windows which introduce time-domain aliasing. However, it is also described that the removal of aliasing comes at the cost of an increased number of transform coefficients, indicating that the resulting filterbank does not have the property of critical sampling anymore.