Audio encoders/decoders have been known for a long time. In particular audio encoders/decoders operating according to the standard ISO/IEC 11172-3, wherein this standard is also known as the MP3 standard, are referred to as transformation encoders. Such an MP3 encoder receives a sequence of time samples as an input signal which are subjected to a windowing. The windowing leads to sequential blocks of time samples which are then converted into a spectral representation block by block. According to the MP3 standard, here a conversion is performed with a so-called hybrid filter bank. The first stage of the hybrid filter bank is a filter bank having 32 channels in order to generate 32 subband signals. The subband filters of this first stage comprise overlapping passbands, which is why this filtering is prone to aliasing. The second stage is an MDCT stage to divide the 32 subband signals into 576 spectral values. The spectral values are then quantized considering the psychoacoustic model and subsequently Huffman encoded in order to finally obtain a sequence of bits including a stream of Huffman code words and side information for decoding.
On the decoder side, the Huffman code words are then calculated back into quantization indices. A requantization leads to spectral values which are then fed into a hybrid synthesis filter bank which is implemented analog to the analysis filter bank to again obtain blocks of time samples of the encoded and again decoded audio signal. All steps on the encoder side and on the decoder side are presented in the MP3 standard. With regard to the terminology it is noted that in the following reference is also made to an “inverse quantization”. Although a quantization is not invertible, as it involves an irretrievable data loss, the expression inverse quantization is often used, which is to indicate a requantization presented before.
Also an audio encoder/decoder algorithm called AAC (AAC=Advanced Audio Coding) is known in the art. Such an encoder standardized in the international standard ISO/IEC 13818-7 again operates on the basis of time samples of an audio signal. The time samples of the audio signal are again subjected to a windowing in order to obtain sequential blocks of windowed time samples. In contrast to the MP3 encoder in which a hybrid filter bank is used, in the AAC encoder one single MDCT transformation is performed in order to obtain a sequence of blocks of MDCT spectral values. These MDCT spectral values are then again quantized on the basis of a psychoacoustic model and the quantized spectral values are finally Huffman encoded. On the decoder side processing is correspondingly. The Huffman code words are decoded and the quantization indices or quantized spectral values, respectively, obtained therefrom are then requantized or inversely quantized, respectively, to finally obtain spectral values that may be supplied to an MDCT synthesis filter bank in order to finally obtain encoded/decoded time samples again.
Both methods operate with overlapping blocks and adaptive window functions as described in the experts publication “Codierung von Audiosignalen mit überlappender Transformation und adaptiven Fensterfunktionen”, Bernd Edler, Frequenz, vol. 43, 1989, pp. 252-256.
In particular when transient areas are determined in the audio signal, a switch is performed from long window functions to short window functions in order to obtain a reduced frequency resolution in favor of a better time resolution. A sequence of short windows is introduced by a start window and a sequence of short windows is terminated by stop a window. Thereby, a gapless transition between overlapping long window functions to overlapping short window functions may be achieved. Depending on the implementation, the overlapping area with short windows is smaller than the overlapping area with long windows, which is reasonable with regard to the fact that transient signal portions are present in the audio signal, does not necessarily have to be the case, however. Thus, sequences of short windows as well as sequences of long windows may be implemented with an overlap of 50 percent. In particular with short windows, however, for improving the encoding of transient signal portions, a reduced overlap width may be selected, like for example only 10 percent or even less instead of 50 percent.
Both, in the MP3 standard and also in the AAC standard the windowing exists with long and short windows and the start windows or stop windows, respectively, are scaled such that in general the same block raster may be maintained. For the MP3 standard this means, that for each long block 576 spectral values are generated and that three short blocks correspond to one long block. This means, that one short block generates 192 spectral values. With an overlap of 50 percent, for windowing thus a window length of 1152 time samples is used, as due to the overlap and add principle of a 50 percent overlap two blocks of time samples always lead to one block of spectral values.
Both with MP3 encoders and also with AAC encoders, a lossy compression takes place. Losses are introduced by a quantization of the spectral values taking place. The spectral values are in particular quantized so that the distortions introduced by the quantization also referred to as quantization noise have an energy which is below the psychoacoustic masking threshold.
The coarser an audio signal is quantized, i.e. the greater the quantizer step size, the higher the quantization noise. On the other hand, however, for a coarser quantization a smaller set of quantizer output values is to be considered, so that values quantized coarser may be entropy encoded using less bits. This means, that a coarser quantization leads to a higher data compression, however simultaneously leads to higher signal losses.
These signal losses are unproblematic if they are below the masking threshold. Even if the psychoacoustic masking threshold is only exceeded slightly, this may possibly not yet lead to audible interferences for unskilled listeners. Anyway, however, an information loss takes place which may be undesired for example due to artifacts which may be audible in certain situations.
In particular with broadband data connections or when the data rate is not the decisive parameter, respectively, or when both broadband and also narrowband data networks are available, it may be desirable to have not a lossy but a lossless or almost lossless, compressed presentation of an audio signal.
Such a scalable encoder schematically illustrated in FIG. 7 and an associated decoder schematically illustrated in FIG. 8 are known from the experts publication “INTMDCT—A Link Between Perceptual And Lossless Audio Coding”, Ralf Geiger, Jürgen Herre, Jürgen Koller, Karlheinz Brandenburg, Int. Conference on Acoustics Speech and Signal Processing (ICASSP), 13-17 May, 2002, Orlando, Fla. A similar technology is described in the European Patent EP 1 495 464 B1. The elements 71, 72, 73, 74 illustrate an AAC encoder in order to generate a lossy encoded bit stream referred to as “perceptually coded bitstream” in FIG. 7. This bit stream represents the base layer. In particular, block 71 in FIG. 7 designates the analysis filter bank including the windowing with long and short windows according to the AAC standard. Block 73 represents the quantization/encoding according to the AAC standard and block 74 represents the bit stream generation so that the bit stream on the output side not only includes Huffman code words of quantized spectral values but also the side information, like for example scale factors, etc., so that a decoding may be performed. The lossy quantization in block 73 is here controlled by the psychoacoustic model designated as the “perceptual model” 72 in FIG. 7.
As already indicated, the output signal of block 74 is a base scaling layer which necessitates relatively few bits and is, however, only a lossy representation of the original audio signal and may comprise encoder artifacts. The blocks 75, 76, 77, 78 represent the additional elements which are needed to generate an extension bit stream which is lossless or virtually lossless, as it is indicated in FIG. 7. In particular, the original audio signal is subjected to an integer MDCT (IntMDCT) at the input 70, as it is illustrated by block 75. Further, the quantized spectral values, generated by block 73, into which encoder losses are already introduced, are subjected to an inverse quantization and to a subsequent rounding in order to obtain rounded spectral values. Those are supplied to a difference former 77 forming a spectral-value-wise difference which is then subjected to an entropy coding in block 78 in order to generate a lossless enhancement bit stream of the scaling scheme in FIG. 7. A spectrum of differential values at the output of block 77 thus represents the distortion introduced by the psychoacoustic quantization in block 73.
On the decoder side the lossy coded bit stream or the perceptually coded bit stream is supplied to a bit stream decoder 81. On the output side, block 81 provides a sequence of blocks of quantized spectral values which are then subjected to an inverse quantization in a block 82. At the output of block 82 thus inversely quantized spectral values are present which now, in contrast to the values at the input of block 82, do not represent quantizer indices anymore, but which are now so to say “correct” spectral values which, however, are different from the spectral values before the encoding in block 73 of FIG. 7 due to the lossy quantization. These quantized spectral values are now supplied to a synthesis filter bank or an inverse MDCT transformation (inverse MDCT), respectively, in block 83 to obtain a psychoacoustically encoded and again decoded audio signal (perceptual audio) which is different from the original audio signal at the input 70 of FIG. 7 due to the encoding errors introduced by the encoder of FIG. 7. In order to not only obtain a lossy but even a lossless compression, the audio signal of block 82 is supplied to a rounding in a block 84. In an adder 85 now the rounded, inversely quantized spectral values are added to the differential values which were generated by the difference former 77, wherein in a block 86 an entropy decoding is performed to decode the entropy code words contained in the extension bit stream containing the lossless or virtually lossless information, respectively.
At the output of block 85, IntMDCT spectral values are thus present which are in the optimum case identical to the MDCT spectral values at the output of block 75 of the encoder of FIG. 7. The same are then subjected to an inverse integer MDCT (inverse IntMDCT), to obtain a coded lossless audio signal or virtually lossless audio signal (lossless audio) at the output of block 87.
The integer MDCT (IntMDCT) is an approximation of the MDCT, however, generating integer output values. It is derived from the MDCT using the lifting scheme. This works in particular when the MDCT is divided into so-called Givens rotations. Then, a two-stage algorithm with Givens rotations and a subsequent DCT-IV result as the integer MDCT on the encoder side and with a DCT-IV and a downstream number of Givens rotations on the decoder side. In the scheme of FIG. 7 and FIG. 8, thus the quantized MDCT spectrum generated in the AAC encoder is used to predicate the integer MDCT spectrum. In general, the integer MDCT is thus an example for an integer transformation generating integer spectral values and again time samples from the integer spectral values, without losses being introduced by rounding errors. Other integer transformations exist apart from the integer MDCT.
The scaling scheme indicated in FIGS. 7 and 8 is only sufficiently efficient when the differences at the output of the difference former 77 are small. In the scheme illustrated in FIG. 7 this is the case, as the MDCT and the integer MDCT are similar and as the IntMDCT in block 75 is derived from the MDCT in block 71, respectively. If this was not the case, the scheme illustrated there would not be suitable, as then the differential values would in many cases be greater than the original MDCT values or even greater than the original IntMDCT values. Then the scaling scheme in FIG. 7 has lost its value as the extension scaling layer output by block 78 has a high redundancy regarding the base scaling layer.
Scalability schemes are always optimal when the base layer comprises a number of bits and when the extension layer comprises a number of bits and when the sum of the bits in the base layer and in the extension layer is equal to a number of bits which would be obtained if the base layer already were a lossless encoding. This optimum case is never achieved in practical scalability schemes, as for the extension layer additional signaling bits are necessitated. This optimum is, however, aimed at as far as possible. As the transformations in blocks 71 and 75 are relatively similar in FIG. 7, the concept illustrated in FIG. 7 is close to optimum.
This simple scalability concept may, however, not just like that be applied to the output signal of an MP3 encoder, as the MP3 encoder, as it was illustrated, comprises no pure MDCT filter bank as a filter bank, but the hybrid filter bank having a first filter bank stage for generating different subband signals and a downstream MDCT for further breaking down the subband signals, wherein in addition, as it is also indicated in the MP3 standard, an additional aliasing cancellation stage of the hybrid filter bank is implemented. As the integer MDCT in block 75 of FIG. 7 has little similarities with the hybrid filter bank according to the MP3 standard, a direct application of the concept shown in FIG. 7 to an MP3 output signal would lead to very high differential values at the output of the difference former 77, which results in an extremely inefficient scalability concept, as the extension layer necessitates far too many bits in order to reasonably encode the differential values at the output of the difference former 77.
A possibility for generating the extension bit stream for an MP3 output signal is illustrated in FIG. 9 for the encoder and in FIG. 10 for the decoder. An MP3 encoder 90 encodes an audio signal and provides a base layer 91 on the output side. The MP3 encoded audio signal is then supplied to an MP3 decoder 92 providing a lossy audio signal in the time range. This signal is then supplied to an IntMDCT block which may in principle be setup just like block 75 in FIG. 7, wherein this block 75 then provides IntMDCT spectral values on the output side which are supplied to a difference former 77 which also includes IntMDCT spectral values as further input values, which were, however, not generated by the MP3 decoded audio signal but by the original audio signal which was supplied to the MP3 encoder 90.
On the decoder side, the base layer is again supplied to an MP3 decoder 92 to provide a lossy decoded audio signal at an output 100 which would correspond to the signal at the output of block 83 of FIG. 8. This signal would then have to be subjected to an integer MDCT 75 to then be encoded together with the extension layer 93 which was generated at the output of the difference former 77. The lossless spectrum would then be present at an output 101 of the adder 102 and would only have to be converted by means of an inverse IntMDCT 103 into the time range in order to obtain a losslessly decoded audio signal which would correspond to the “lossless audio” at the beginning of block 87 of FIG. 8.
The concept illustrated in FIG. 9 and in FIG. 10, which provides a relatively efficiently encoded extension layer just like the concepts illustrated in FIGS. 7 and 8, is expensive both on the encoder side (FIG. 9) and also on the decoder side (FIG. 10), respectively. In contrast to the concept in FIG. 7, a complete MP3 decoder 92 and an additional IntMDCT 75 are necessitated.
Another disadvantage in this scheme is, that a bit-accurate MP3 decoder would have to be defined. This is not intended, however, as the MP3 standard does not represent a bit-accurate specification but only has to be fulfilled within the scope of a “conformance” by a decoder.
On the decoder side, further a complete additional IntMDCT stage 75 is necessitated. Both additional elements cause computational overhead and are disadvantageous in particular for use in mobile devices both with regard to chip consumption and also current consumption and also with regard to the associated delay.
In summary, advantages of the concept illustrated in FIG. 7 and FIG. 8 are, that compared to time domain methods no complete decoding of the audio-adapted encoded signal is necessitated, and that an efficient encoding is obtained by a representation of the quantization error in the frequency range to be encoded additionally. Thus, the method standardized by ISO/IEC MPEG-4 Scalable Lossless Coding (SLS) uses this approach, as described in R. Geiger, R. Yu, J. Herre, S. Rahardja, S. Kim, X. Lin, M. Schmidt, “ISO/IEC MPEG-4 High-Definition Scalable Advanced Audio Coding”, 120th AES meeting, May 20-23, 2006, Paris, France, Preprint 6791. Thus, a backward compatible, lossless extension of audio encoding methods, for example MPEG-2/4 AAC, is obtained which use the MDCT as a filter bank.
This approach may, however, not directly be applied to the widely used method MPEG-½ Layer 3 (MP3), as the hybrid filter bank used in this method, in contrast to the MDCT, is not compatible with the IntMDCT or another integer transformation. Thus, a difference formation between the decoded spectral values and the corresponding IntMDCT values in general does not lead to small differential values and thus not to an efficient encoding of the differential values. The core of the problem here is the time shifts between the corresponding modulation functions of the IntMDCT and the MP3 hybrid filter bank. These lead to phase shifts which in unfavorable cases even lead to the fact that the differential values comprise higher values than the IntMDCT values. Also an application of the principles underlying the IntMDCT, like for example the lifting scheme, to the hybrid filter bank of MP3 is problematic, as regarding its basic approach—in contrast to MDCT—the hybrid filter bank is a filter bank which provides no perfect reconstruction.