1. Field of the Invention
The present invention relates to encoding techniques and particularly to audio encoding techniques.
2. Description of the Related Art
Audio encoders, and particularly such encoders known under the keyword “mp3”, “AAC” or “mp3PRO” have recently gained wide acceptance. They allow the compression of audio signals, which require a significant amount of data, when they are present, for example, in PCM format on an audio CD, to “tolerable” data rates, which are suitable for the transmission of the audio signals across channels with limited bandwidth. Thus, for transmitting data in the PCM format, data rates of up to 1.4 Mbit/s are required. “mp3”-encoded audio data achieve already a stereo sound with high quality at data rates of 128 kbit/s.
Further, the spectral band replication (SBR) is a known method, which increases the efficiency of existing hearing adapted perceptual audio encoders significantly. The SBR technique is described in WO 98/57436 and implemented in the “mp3PRO” format. Here, a good stereo quality is already achieved with data rates of 64 kbit/s.
The European Patent EP 0 846 375 B1 discloses a method and an apparatus for scalable encoding of audio signals. An audio signal is encoded via a first encoder to obtain the bit stream for the first encoder. This signal is then decoded again, with a decoder adapted to the first encoder. The decoder output signal is supplied together with the delayed original audio signal to a differential stage to generate a differential signal. This differential signal is compared bandwise to the original audio signal in order to determine for spectral bands whether the energy of the differential signal is greater than the energy of the audio signal. If this is the case, the original audio signal will be supplied to a second encoder, while, when the energy of the differential signal is smaller than the energy of the original audio signal, the differential signal will be supplied to the second encoder. The second encoder is a transform encoder, which operates, based on a psychoacoustic model. Like the bit stream of the first encoder, the bit stream on the output side of the second encoder is also fed into a bit stream multiplexer, which provides a so-called scaled bit stream on the output side. In this connection, scalability means that a decoder is able, depending on the design, to extract either only the bit stream of the first encoder from the bit stream on the decoder side or to extract both the bit stream of the first encoder and the bit stream of the second encoder to obtain, in the first case, a less qualitative reproduction and in the second case a high quality reproduction of the original audio signal.
A typically transform-based encoder is illustrated in FIG. 4a. The audio signal is supplied to an analysis filter bank 400, which forms at its input a block with a certain number of samples of the audio signal from the stream of sample values via blocking and windowing, respectively, and converts it into a spectral representation. The spectral coefficients and subband signals, respectively, generated at the output of the analysis filter bank are quantized. The quantizer step width will depend on different factors. A significant factor is a psychoacoustic masking threshold, which is calculated by a psychoacoustic model 402 from the original audio signal. The quantizer in a block “quantizing and encoding 404” will always try to quantize as coarsely as possible to obtain a good compression. On the other hand, however, it will also try to quantize as finely as possible such that the quantizing noise introduced by the quantizing lies below the psychoacoustic masking threshold provided by block 402, as it is known in the art. The spectral values quantized in that way will then be subjected to an entropy encoding, wherein typically a Huffman encoding is used as entropy encoding, which typically operates with predefined Huffman code books and Huffman code tables, respectively. Then, entropy-encoded quantized spectral values are applied to the output of block 404, which are written into a bit stream 408 together with the side information required for the decoding via block 406, wherein this bit stream can be stored or, depending on the field of application, transmitted across a transmission channel to a decoder, which is illustrated in FIG. 4b. First, the decoder comprises a block 410 for reading the bit stream, to extract, on the one hand, the side information and, on the other hand, the entropy-encoded quantized spectral values from the bit stream. Then, the entropy-encoded quantized spectral values are first supplied to an entropy decoding and then to an inverse quantizing, to obtain inverse-quantized spectral values (block 412), which are then supplied via a synthesis filter bank 414 adapted to the analysis filter bank 400, to obtain a time-discrete decoded audio signal on the output side. This time-discrete audio signal at the output of the synthesis filter bank can then be supplied to a loudspeaker after appropriate interpolation and digital/analog conversion and, if necessary, amplification and thereby be made audible.
Block-based encoder/decoders, as they are used in the known scenario shown in FIGS. 4a and 4b, are based on the fact that typically a block of samples, such as 1024 and 2048 with an MDCT known in the art with Overlap and Add, respectively, time-discrete samples of audio signal are converted into the spectral range. Even with less frequency-resolving filter banks, such as the SBR filter bank with 64 channels, a block of samples with a certain number of samples is also always used and converted into a spectral representation, namely here the individual subband signals. Then, as has been discussed, the spectral representation will be quantized accordingly, typically with the help of a psychoacoustic model, which calculates the psychoacoustic masking threshold in the way known in the art.
Such transforms have inherently a certain time/frequency resolution. This means, that when a large number of samples are inserted into a block, a transform applied to the block does inherently have a high frequency resolution. On the other hand, the time resolution is reduced accordingly. If the shorter portions of the audio signal were converted into the spectral range for increasing the time resolution, this would lead to the fact that the frequency solution suffers correspondingly.
Therefore, it is a problem that audio signals can only be considered stationary for very short time periods. There are certainly short-term strong energy increases, which are called transients, during which the audio signal is not stationary.
In order to address this problem of time/frequency resolution, block switching, which is controlled by a transient detector, is used for example in the AAC encoder (AAC=advanced audio coding). Here, the audio signal to be encoded is examined prior to windowing and blocking, respectively, in order to determine whether the audio signal has such a transient or not. If a transient is determined, short blocks are used for encoding. If, however, a signal section without transient is detected, a long block length is used. Thus, in such common transform encoding methods, block switching is used for adapting the transform length to the signal. Particularly when low bit rates are to be achieved, preferably, very long transform lengths are used, since the ratio of page information to useful information is typically relatively independent of the block length. This means that the amount of page information is mostly the same, independent of the fact whether the block represents a large number of time samples of the audio signal or whether a block is short, i.e. represents a small number of samples. Thus, for reasons of encoding efficiency, one aims at using always block lengths as great as possible, and great transform lengths in a transform encoder, respectively.
On the other hand, for transient detection and switching to short windows at the appearance of non-stationary ranges of the audio signal, a processing effort has to accepted, which, however, still leads to the fact that the signal in its encoded form exists either only with good frequency resolution or only with good time resolution.