Up to date audio-encoding methods, such as e.g. MPEG layer 3 (MP3) or MPEG AAC, use transforms, such as for example the so-called modified discrete cosine transform (MDCT), so as to obtain a block-wise frequency representation of an audio signal. Such an audio-encoder usually obtains a current from time-discrete audio sampled values. The current from audio sampled values is windowed so as to obtain a windowed block of for example 1024 or 2048 windowed audio sampled values. For windowing, various window functions are used, such as, for example, a sine window, etc.
The windowed time-discrete audio sampled values will then be implemented in a spectral representation by means of a filter bank. In principle, a Fourier transform or, for special reasons, a variety of said Fourier-transforms, such as for example an FFT or, as has been executed, an MDCT may be used. The block of audio-spectral values at the output of the filter bank may then be subjected to further processing as required. With the above-specified audio-encoders, a quantizing of the audio spectral values follows, with the quantizing stages being typically selected such that the quantizing noise, which is introduced by means of quantizing, ranges below the psycho-acoustic masking threshold, i.e. is “masked away”. Quantizing represents a lossy encoding. In order to obtain a further data amount reduction, the quantized spectral value will then be subjected to an entropy-encoding by means of a Huffman-encoding. By adding side information, such as for example scale factors etc., a bit stream, which may be stored or transferred, is formed from the entropy-encoded quantized spectral values by means of a bit stream multiplexer.
In the audio decoder, the bit stream is organized into coded quantized spectral values and side information by means of a bit stream demultiplexer. The entropy-encoded quantized spectral values are first entropy-encoded, so as to obtain the quantized spectral values. The quantized spectral values will then be inversely quantized, so as to obtain decoded spectral values comprising quantizing noise, which, however, ranges below the psycho-acoustic masking threshold and will therefore not be heard. These spectral values will then be implemented in a time representation by means of a synthesis filter bank, so as to obtain time-discrete decoded audio sampled values. In the synthesis filter bank a transform algorithm inverse to the transform algorithm has to be employed. Moreover, after the frequency-time retransform, windowing has to be cancelled.
In order to obtain a good frequency selectivity, up to date audio-encoder typically use block overlapping. Such a case is represented in FIG. 10a. At first, for example 2048 time-discrete audio sampled values are taken and windowed by means of a means 402. The window, which embodies the means 402, has a window length of 2N sampled values and provides a block of 2N windowed sampled values at its output-side. In order to obtain a window overlapping, a second block of 2N windowed sampled values is formed by means of a means 404, which, just for the sake of clarity, is separately represented from the means 402 in FIG. 10a. The 2048 sampled values fed into the means 404, however, are not the time-discrete audio sampled values to be immediately connected to the first window, but include the second half of the sampled values windowed by the means 402 and additionally include only 1024 new sampled values. In FIG. 10a, the overlapping is symbolically represented by a means 406, which causes a degree of overlapping of 50%. Both the two N windowed sampled values output by the means 402 and the 2N windowed sampled values output by the means 404 will then be subjected to the MDCT algorithm by means of a means 408 and/or 410. The means 408 provides N spectral values in accordance with the prior art MDCT algorithm for the first window, while the means 410 also provides N spectral values, however, for the second window, with an overlapping of 50% existing between the first window and the second window.
In the decoder, the N spectral values of the first window, as is shown in FIG. 10b, will be fed to a means 412, which carries out an inverse modified discrete cosine transform. The same applies to the N spectral values of the second window. The same will be fed to a means 414, which also carries out an inverse modified discrete cosine transform. Both the means 412 and the means 414 provide 2 N sampled values each for the first window and/or 2 N sampled values for the second window.
A means 416, which is referred to as TDAC (TDAC=time domain aliasing cancellation) in FIG. 10b, considers the fact that the two windows are overlapping. In particular, a sampled value y1 of the second half of the first window, i.e. with an index N+k, is summed with a sampled value y2 from the first half of the second window, i.e. with an index k, such that, at the output-side, i.e. in the decoder, N decoded time sampled values will result.
It should be appreciated, that by means of the function of means 416, which may also be referred to as an add function, the windowing carried out in the encoder schematically represented by FIG. 10a is automatically considered, such that in the decoder represented by FIG. 10b, no explicit “inverse windowing” has to take place.
If the window function implemented by the means 402 or 404 is designated with w(k), with the index k representing the time index, the condition has to be fulfilled that the squared window weight w(k) added to the squared window weight w(N+k) leads to a square of unity, with k ranging from 0 to N−1. If a sine window is used, the window weightings of which follow the first half wave of the sine function, this condition is always fulfilled, since the square of the sine and the square of the cosine always result in the value 1 for each angle.
A disadvantage of the window method described in FIG. 10a with a subsequent MDCT function is the fact that the windowing is achieved by a multiplication of time-discrete sampled value, and thinking of a sine window, with a floating-point number, since the sine of an angle between 0 and 180 degree, apart from the angle of 90 degree, does not result in an integer. Even if integer time-discrete sampled values are windowed, floating-point numbers will result after windowing.
Therefore, even if no psycho-acoustic encoder is used, i.e. if no lossless encoding is to be achieved, a quantizing is necessary at the output of the means 408 and/or 410 so as to be able to carry out a reasonably clear entropy-encoded process.
If, therefore, known transforms, as have been operated by means of FIG. 10a, should by employed for a lossless audio-encoding, either a very fine quantizing has to be employed in order to be able to neglect the resulting error on the basis of the rounding of the floating-point numbers or the error signal has to be additionally encoded, for example in the time domain.
Concepts of the first kind, that is, concepts in which the quantization is so finely tuned that the resulting error is negligible on the basis of the rounding of the floating-point numbers, are for example disclosed in the German patent application DE 1 97 42 201 C1. Here, an audio signal is transferred into its spectral representation and quantized so as to obtain quantized spectral values. The quantized spectral values are again inversely quantized, transferred into the time domain, and compared to the original audio signal. If the error, meaning the error between the original audio signal and the quantized/inversely quantized audio signal, ranges above an error threshold, the quantizer will be more finely tuned in a feedback-like manner, and the comparison will then be carried out anew. The iteration is finished, when the error falls below the error threshold. The possibly still existing residual signal will be encoded with a time domain encoder and written into a bit stream, which, in addition to the time domain-encoded residual signal, also includes encoded spectral values which have been quantized in accordance with the quantizer settings available at the time of interruption of the iteration. It should be appreciated that the quantizer used does not have to be controlled by a psycho-acoustic model, so that the encoded spectral values are typically quantized more precisely as it should be on the basis of the psycho-acoustic model.
In the technical publication “A Design of Lossy and lossless Scalable Audio Coding”, T. Moriya et al, Proc. ICASSP, 2000, a scalable encoder is described, which comprises, as a first lossy data compression module, an MPEG encoder, for example, which has a block-wise digital wave form as an input signal and which generates the compressed bit code. In a local decoder, which is also present, encoding is eliminated, and an encoded/decoded signal will be generated. This signal will be compared to the original input signal by subtracting the encoded/decoded signal from the original input signal. The error signal will than be fed into a second module, where a lossless bit conversion is used. This conversion has two steps. The first step consists in a conversion of a two's complement format into a value sign format. The second step consists converting of a vertical magnitude sequence into a horizontal bit sequence in a processing block. The lossless data conversion is carried out so as to maximize the number of signals or to maximize the number of succeeding zeroes in a sequence so as to achieve an as good a compression of the time error signals as possible, which is available as a result of the digital numbers. This principle is based on a Bit Slice Arithmetic Coding scheme (BSAC scheme), which is represented in the technical publication “Multi-Layer Bit Sliced Bit Rate Scalable Audio Coder”, 103. AES convention, pre-print No. 4520, 1997.
The above-mentioned BSAC publication discloses something like an encoder, as is represented in FIG. 8. A time signal will be fed into a block 80, which is designated with “Windows” and time-frequency translation. Typically, use is made of an MDCT (MDCT=modified discrete cosine transform) in block 80. Thereupon, the MDCT spectral value generated by the block 80 will be quantized in a block 82 so as to obtain quantized spectral values in binary form. The quantizing by the block 82 will be controlled by a means 84 calculating a masking threshold using a psycho-acoustic model, with the quantizing in block 82 being carried out such that the quantizing noise remains below the psycho-acoustic masking threshold. In block 85, the quantized spectral values will then be arranged on a bit-wise basis, such that the bits of equal order of the quantized spectral values are arranged in one column. In block 86, scaling layers will then be formed, with one scaling layer corresponding to a column. A scaling layer therefore comprises the bits of equal order of all spectral values quantized. Subsequently, each scaling layer will be successively subjected to arithmetic encoding (block 87), while the scaling layers output by block 87, in their redundantly encoded form, will be fed to a bit-stream formation means, with means 88 providing the scaled/encoded signal on its output side, which, apart from the individual scaling layers, will also include side information, as is known.
Generally speaking, the prior state scalable BSAC encoder will take the highest order bits of all spectral values quantized in accordance with psycho-acoustic aspects, subject them to arithmetic encoding and then write them into the bit stream as a first scaling layer. Typically, since very few very large spectral values will be available, very few quantized spectral values will have a highest order bit equal to “1”.
For generating the second scaling layer, the bits of the second highest order of all spectral values will be taken, subjected to arithmetic encoding and then written into the bit stream as a second scaling layer. This procedure will be repeated as many times until the bits of the least order of all quantized spectral values have been arithmetically encoded and written into the bit stream as a last scaling layer.
FIG. 9 shows a scalable decoder for decoding scaled/decoded signals generated by the scalable encoder shown in FIG. 8. First, the scalable decoder includes a bit stream deformatting means 90, a scaling layer extraction means/decoding means 91, an inverse quantizing means 92 as well as a frequency domain/time domain translation means 93 so as to obtain a decoded signal, the quality of which is proportionally dependent on the number of the number of scaling layers selected by the means 91.
In detail, the bit stream deformation means will depack the bit stream and will provide the various scaling layers in addition to the side information. First, the means 91 will arithmetically decode and store the first scaling layer. Then, the second scaling layer will be arithmetically decoded and stored. This procedure will be repeated as many times until either all scaling layers contained in the scaled/encoded signal have been arithmetically decoded and stored, or it will be repeated as many times until the number of scaling layers requested via a control input 94 have been decoded and stored. Thus, the binary patterns for each individual quantized spectral line will be successively generated, with these quantized spectral values, which are represented in binary form, being subjected to the inverse quantization 92 in consideration of a scale factor etc. so as to obtain inversely quantized spectral values which have to be translated into the time domain by the means 93 so as to obtain the decoded signal.
When decoding, a bit for each spectral value is thus obtained with each scaling layer. The bits for each spectral line, which are available after decoding five scaling layers, include the uppermost five bits. It should be appreciated, that in case of very small spectral values, the most significant bits of which only come in fifth place, the MSB (MSB=most significant bit) of this spectral line will not be available after decoding five scaling layers, wherein, for a more precise representation of this spectral line, further scaling layers have to be processed.
The binary representation of spectral values results in that—with the MDCT spectral values being for example amplitude values—each additional bit stands for a precision gain for the spectral line of 6 db.
Thus each additional scaling layer will result in an increase in precision of all spectral values by 6 db.
Considering that at least in noisy signals, the masking threshold of hearing ranges only approximately 6 db below the signal, it will show that a bit-wise scaling is problematic in terms of precision, this bit-wise scaling being provided by the prior art encoder/decoder concept and being used, in particular, for an efficient encoding of the signal portions which are just about to be heard, that is, for example, for the lower bits of the spectral values quantized in accordance with psycho-acoustic aspects.
If, for example, on the basis of a transmission channel bottleneck situation, the lowest scaling layer of the scaled/encoded signal output by block 88 from FIG. 8, is not transmitted, this would result in precision losses of 6 db, which, in an unfavourable constellation, will result in clearly audible interferences in the decoded signal.