1. Field of the Invention
The present invention relates to the audio coding/decoding, and in particular to scalable coding/decoding algorithms with a psychoacoustic first scaling layer and a second scaling layer including ancillary audio data for lossless decoding.
2. Description of the Related Art
Modern audio coding methods, such as MPEG Layer3 (MP3) or MPEG AAC, use transforms, such as the so-called modified discrete cosine transform (MDCT), to obtain a block-wise frequency representation of an audio signal. Such an audio coder usually obtains a stream of time-discrete audio samples. A stream of audio samples is windowed to obtain a windowed block of for example 1,024 or 2,048 windowed audio samples. For the windowing, various window functions are employed, such as a sine window, etc.
The windowed time-discrete audio samples are then converted to a spectral representation by means of a filter bank. In principle, a Fourier transform, or a variety of the Fourier transform for special reasons, such as a FFT or, as has been set forth, a MDCT, may be employed for this. The block of audio spectral values at the output of the filter bank may then be processed further depending on demand. In the above-referenced audio coders, a quantization of the audio spectral values follows, wherein the quantization stages are typically chosen so that the quantization noise introduced by the quantizing lies below the psychoacoustic masking threshold, i.e. is “masked away”. The quantization is a lossy coding. In order to obtain further data amount reduction, the quantized spectral values are then entropy coded for example by means of Huffman coding. By adding side information, such as scale factors etc., a bit stream, which may be stored or transmitted, is formed from the entropy-coded quantized spectral values by means of a bit stream multiplexer.
In the audio decoder, the bit stream is split up in coded quantized spectral values and side information by means of a bit stream de-multiplexer. The entropy-coded quantized spectral values are at first entropy decoded to obtain the quantized spectral values. The quantized spectral values are then inversely quantized to obtain decoded spectral values comprising quantization noise, which, however, lies below the psychoacoustic masking threshold and will thus be inaudible. These spectral values are then converted into a temporal representation by means of a synthesis filter bank to obtain time-discrete decoded audio samples. In the synthesis filter bank, a transform algorithm inverse to the transform algorithm has to be employed. Moreover, the windowing has to be cancelled after the frequency-time inverse or backward transform.
In order to achieve good frequency selectivity, modern audio coders typically use block overlap. Such a case is illustrated in FIG. 4a. At first for example 2,048 time-discrete audio samples are taken and windowed by means of means 402. The window embodying means 402 has a window length of 2N samples and provides a block of 2N windowed samples at the output side. In order to achieve window overlap, by means of means 404, which is illustrated separate from means 402 only for clarity reasons in FIG. 4a, a second block of 2N windowed samples is formed. The 2,048 samples fed to means 404, however, are not the time-discrete audio samples immediately ensuing the first window, but contain the second half of the samples windowed by means 402 and additionally contain only 1,024 “new” samples. The overlap is symbolically illustrated by means 406 in FIG. 4a, causing an overlapping degree of 50%. Both the 2N windowed samples output by means 402 and the 2N windowed samples output by means 404 are then subjected to the MDCT algorithm by means of means 408 and 410, respectively. Means 408 provides N spectral values for the first window according to the known MDCT algorithm, whereas means 410 also provides N spectral values, but for the second window, wherein there is an overlap of 50% between the first window and the second window.
In the decoder, the N spectral values of the first window, as it is shown in FIG. 4b, are fed to means 412 performing an inverse modified discrete cosine transform. The same applies for the N spectral values of the second window. These are fed to means 414 also performing an inverse modified discrete cosine transform. Both means 412 and means 414 each provide 2N samples for the first window and 2N samples for the second window, respectively.
In means 416, designated with TDAC (time domain aliasing cancellation) in FIG. 4b, the fact is taken into account that the two windows are overlapping. In particular, a sample y1 of the second half of the first window, i.e. with an index N+k, is summed with a sample y2 from the first half of the second window, i.e. with an index k, so that N decoded temporal samples result at the output side, i.e. in the decoder.
It is to be noted that by the function of means 416, which is also referred to as add function, the windowing performed in the coder schematically illustrated by FIG. 4a is taken into account somewhat automatically, so that in the decoder illustrated by FIG. 4b no explicit “inverse windowing” has to take place.
When the window function implemented by means 402 or 404 is designated with w(k), wherein the index k represents the time index, the condition has to be met that the squared window weight w(k) added to the squared window weight w(N+k) together are 1, wherein k runs from 0 to N−1. When a sine window is used, the window weights of which follow the first half-wave of the sine function, this condition is always met, since the square of the sine and the square of the cosine for each angle together result in the value 1.
Disadvantageous in the window method with ensuing MDCT function described in FIG. 4a is the fact that the windowing by multiplication of a time-discrete sample, when it is thought of a sine window, it is achieved with a floating-point number, since the sine of an angle between 0 and 180 degrees does not yield an integer, apart from the angle 90 degrees. Even when integer time-discrete samples are windowed, floating-point numbers result after the windowing.
Therefore, even when no psychoacoustic coder is used, i.e. when lossless coding is to be achieved, quantization is necessary at the output of means 408 or 410 to be able to perform reasonably manageable entropy coding.
When known transforms, as they have been described on the basis of FIG. 4a, are to be employed for lossless audio coding, either very fine quantization has to be employed to be able to neglect the resulting error due to rounding the floating-point numbers, or the error signal has to be additionally coded for example in the time domain.
Concepts of the former kind, i.e. in which the quantization is so finely adjusted that the resulting error due to the rounding of the floating-point numbers is negligible, are for example disclosed in the German patent DE 197 42 201 C1. Here, an audio signal is converted to its spectral representation and quantized to obtain quantized spectral values. The quantized spectral values are then inversely quantized, converted to the time domain, and compared with the original audio signal. If the error, i.e. the error between the original audio signal and the quantized/inversely quantized audio signal, lies above an error threshold, the quantizer is more finely adjusted in feedback, and the comparison is performed again. The iteration is terminated, when the error threshold is underrun. The maybe still present residual signal is coded with a time domain coder and written into a bit stream including, apart from the time-domain-coded residual signal, also coded spectral values having been quantized according to the quantizer adjustments that were present at the time of the cancellation of the iteration. It is to be noted that the quantizer does not have to be controlled from a psychoacoustic model, so that the coded spectral values are typically quantized more accurately than this would have to be due to the psychoacoustic model.
In the publication “A Design of Lossy and Lossless Scalable Audio Coding”, T. Moriya et al., Proc. ICASSP, 2000, a scalable coder is described, which includes e.g. an MPEG coder as first lossy data compression module, which has a block-wise digital signal form as input signal and generates the compressed bit stream. In an also present local decoder the coding is cancelled again, and a coded/decoded signal is generated. This signal is compared with the original input signal by subtracting the coded/decoded signal from the original input signal. The error signal is then fed to a second module, where a lossless bit conversion is used. This conversion has two steps. The first step consists in a conversion from a two's complement format to a presign-magnitude format. The second step consists in a conversion from a vertical magnitude sequence to a horizontal bit sequence in a processing block. The lossless data conversion is executed to maximize the number of zeros or to maximize the number of successive zeros in a sequence, in order to achieve an as-good-as-possible compression of the temporal error signal present as a result of digital numbers. This principle is based on a bit slice arithmetic coding (BSAC) scheme illustrated in the publication “Multi-Layer Bit Sliced Bit Rate Scalable Audio Coder”, 103rd AES Convention, Preprint No. 4520, 1997.
Disadvantageous in the above-described concepts is the fact that the data for the lossless expansion layer, i.e. the ancillary data required to achieve lossless decoding of the audio signal has to be obtained in the time domain. This means that complete decoding including a frequency/time conversion is required to obtain the coded/decoded signal in the time domain, so that by means of a sample-wise difference formation between the original audio input signal and the coded/decoded audio signal, which is lossy due to the psychoacoustic coding, the error signal is calculated. This concept is particularly disadvantageous in that in the coder generating the audio data stream both complete time/frequency conversion means, such as a filter bank or e.g. a MDCT algorithm, is required for the forward transform, and at the same time, only to generate the error signal, a complete inverse filter bank or a complete synthesis algorithm is required. The coder thus, in addition to its inherent coder functionalities, also has to contain the complete decoder functionality. If the coder is implemented in software, both storage capacities and processor capacities are required for this, leading to a coder implementation with increased expenditure.