1. Field of the Invention
The present invention relates to coders for encoding a signal including audio and/or video information, and in particular to the estimation of a need for information units for encoding this signal.
2. Description of the Related Art
The prior art coder will be presented below. An audio signal to be coded is supplied in at an input 1000. This audio signal is initially fed to a scaling stage 1002, wherein so-called AAC gain control is conducted to establish the level of the audio signal. Side information from the scaling is supplied to a bit stream formatter 1004, as is represented by the arrow located between block 1002 and block 1004. The scaled audio signal is then supplied to an MDCT filter bank 1006. With the AAC coder, the filter bank implements a modified discrete cosine transformation with 50% overlapping windows, the window length being determined by a block 1008.
Generally speaking, block 1008 is present for the purpose of windowing transient signals with relatively short windows, and of windowing signals which tend to be stationary with relatively long windows. This serves to reach a higher level of time resolution (at the expense of frequency resolution) for transient signals due to the relatively short windows, whereas for signals which tend to be stationary, a higher frequency resolution (at the expense of time resolution) is achieved due to longer windows, there being a tendency of preferring longer windows since they result in a higher coding gain. At the output of filter bank 1006, blocks of spectral values—the blocks being successive in time—are present which may be MDCT coefficients, Fourier coefficients or subband signals, depending on the implementation of the filter bank, each subband signal having a specific limited bandwidth specified by the respective subband channel in filter bank 1006, and each subband signal having a specific number of subband samples.
What follows is a presentation, by way of example, of the case wherein the filter bank outputs temporally successive blocks of MDCT spectral coefficients which, generally speaking, represent successive short-term spectra of the audio signal to be coded at input 1000. A block of MDCT spectral values is then fed into a TNS processing block 1010 (TNS=temporary noise shaping), wherein temporal noise shaping is performed. The TNS technique is used to shape the temporal form of the quantization noise within each window of the transformation. This is achieved by applying a filtering process to parts of the spectral data of each channel. Coding is performed on a window basis. In particular, the following steps are performed to apply the TNS tool to a window of spectral data, i.e. to a block of spectral values.
Initially, a frequency range for the TNS tool is selected. A suitable selection comprises covering a frequency range of 1.5 kHz with a filter, up to the highest possible scale factor band. It shall be pointed out that this frequency range depends on the sampling rate, as is specified in the AAC standard (ISO/IEC 14496-3: 2001 (E)).
Subsequently, an LPC calculation (LPC=linear predictive coding) is performed, to be precise using the spectral MDCT coefficients present in the selected target frequency range. For increased stability, coefficients which correspond to frequencies below 2.5 kHz are excluded from this process. Common LPC procedures as are known from speech processing may be used for LPC calculation, for example the known Levinson-Durbin algorithm. The calculation is performed for the maximally admissible order of the noise-shaping filter.
As a result of the LPC calculation, the expected prediction gain PG is obtained. In addition, the reflection coefficients, or Parcor coefficients, are obtained.
If the prediction gain does not exceed a specific threshold, the TNS tool is not applied. In this case, a piece of control information is written into the bit stream so that a decoder knows that no TNS processing has been performed.
However, if the prediction gain exceeds a threshold, TNS processing is applied.
In a next step, the reflection coefficients are quantized. The order of the noise-shaping filter used is determined by removing all reflection coefficients having an absolute value smaller than a threshold from the “tail” of the array of reflection coefficients. The number of remaining reflection coefficients is in the order of magnitude of the noise-shaping filter. A suitable threshold is 0.1.
The remaining reflection coefficients are typically converted into linear prediction coefficients, this technique also being known as “step-up” procedure.
The LPC coefficients calculated are then used as coder noise shaping filter coefficients, i.e. as prediction filter coefficients. This FIR filter is used for filtering in the specified target frequency range. An autoregressive filter is used in decoding, whereas a so-called moving average filter is used in coding. Eventually, the side information for the TNS tool is supplied to the bit stream formatter, as is represented by the arrow shown between the TNS processing block 1010 and the bit stream formatter 1004 in FIG. 3.
Then, several optional tools which are not shown in FIG. 3 are passed through, such as a long-term prediction tool, an intensity/coupling tool, a prediction tool, a noise substitution tool, until eventually a mid/side coder 1012 is arrived at. The mid/side coder 1012 is active when the audio signal to be coded is a multi-channel signal, i.e. a stereo signal having a left-hand channel and a right-hand channel. Up to now, i.e. upstream from block 1012 in FIG. 3, the left-hand and right-hand stereo channels have been processed, i.e. scaled, transformed by the filter bank, subjected to TNS processing or not, etc., separately from one another.
In the mid/side coder, verification is initially performed as to whether a mid/side coding makes sense, i.e. will yield a coding gain at all. Mid/side coding will yield a coding gain if the left-hand and right-hand channels tend to be similar, since in this case, the mid channel, i.e. the sum of the left-hand and the right-hand channels, is almost equal to the left-hand channel or the right-hand channel, apart from scaling by a factor of ½, whereas the side channel has only very small values since it is equal to the difference between the left-hand and the right-hand channels. As a consequence, one can see that when the left-hand and right-hand channels are approximately the same, the difference is approximately zero, or includes only very small values which—this is the hope—will be quantized to zero in a subsequent quantizer 1014, and thus may be transmitted in a very efficient manner since an entropy coder 1016 is connected downstream from quantizer 1014.
Quantizer 1014 is supplied an admissible interference per scale factor band by a psycho-acoustic model 1020. The quantizer operates in an iterative manner, i.e. an outer iteration loop is initially called up, which will then call up an inner iteration loop. Generally speaking, starting from quantizer step-size starting values, a quantization of a block of values is initially performed at the input of quantizer 1014. In particular, the inner loop quantizes the MDCT coefficients, a specific number of bits being consumed in the process. The outer loop calculates the distortion and modified energy of the coefficients using the scale factor so as to again call up an inner loop. This process is iterated for such time until a specific conditional clause is met. For each iteration in the outer iteration loop, the signal is reconstructed so as to calculate the interference introduced by the quantization, and to compare it with the permitted interference supplied by the psycho-acoustic model 1020. In addition, the scale factors of those frequency bands which after this comparison still are considered to be interfered with are enlarged by one or more stages from iteration to iteration, to be precise for each iteration of the outer iteration loop.
Once a situation is reached wherein the quantization interference introduced by the quantization is below the permitted interference determined by the psycho-acoustic model, and if at the same time bit requirements are met, which state, to be precise, that a maximum bit rate be not exceeded, the iteration, i.e. the analysis-by-synthesis method, is terminated, and the scale factors obtained are coded as is illustrated in block 1014, and are supplied, in coded form, to bit stream formatter 1004 as is marked by the arrow which is drawn between block 1014 and block 1004. The quantized values are then supplied to entropy coder 1016, which typically performs entropy coding for various scale factor bands using several Huffman-code tables, so as to translate the quantized values into a binary format. As is known, entropy coding in the form of Huffman coding involves falling back on code tables which are created on the basis of expected signal statistics, and wherein frequently occurring values are given shorter code words than less frequently occurring values. The entropy-coded values are then supplied, as actual main information, to bit stream formatter 1004, which then outputs the coded audio signal at the output side in accordance with a specific bit stream syntax.
The data reduction of audio signals by now is a known technique, which is the subject of a series of international standards (e.g. ISO/MPEG-1, MPEG-2 AAC, MPEG-4).
The above-mentioned methods have in common that the input signal is turned into a compact, data-reduced representation by means of a so-called encoder, taking advantage of perception-related effects (psychoacoustics, psychooptics). To this end, a spectral analysis of the signal is usually performed, and the corresponding signal components are quantized, taking a perception model into account, and then encoded as a so-called bit stream in as compact a manner as possible.
In order to estimate, prior to the actual quantization, how many bits a certain signal portion to be encoded will require, the so-called perceptual entropy (PE) may be employed. The PE also provides a measure for how difficult it is for the encoder to encode a certain signal or parts thereof.
The deviation of the PE from the number of actually required bits is crucial for the quality of the estimation. Furthermore, the perceptual entropy and/or each estimate of a need for information units for encoding a signal may be employed to estimate whether the signal is transient or stationary, since transient signals also require more bits for encoding than rather stationary signals. The estimation of a transient property of a signal is, for example, used to perform a window length decision, as it is indicated in block 1008 in FIG. 3.
In FIG. 6, the perceptual entropy is illustrated as calculated according to ISO/IEC IS 13818-7 (MPEG-2 advanced audio coding (AAC)). The equation illustrated in FIG. 6 is used for the calculation of this perceptual entropy, that is to say a band-wise perceptual entropy. In this equation, the parameter pe represents the perceptual entropy. Furthermore, width(b) represents the number of the spectral coefficients in the respective band b. Furthermore, e(b) is the energy of the signal in this band. Finally, nb(b) is the corresponding masking threshold or, more generally, the admissible interference that can be introduced into the signal, for example by quantization, so that a human listener nevertheless hears no or only an infinitesimal interference.
The bands may originate from the band division of the psychoacoustic model (block 1020 in FIG. 3), or they may be the so-called scale factor bands (scfb) used in the quantization. The psychoacoustic masking threshold is the energy value the quantization error should not exceed.
The illustration shown in FIG. 6 thus shows how well a perceptual entropy determined in this way functions as an estimation of the number of bits required for the coding. To this end, the respective perceptual entropy was plotted depending on the used bits at the example of an AAC coder at different bit rates for every individual block. The test piece used contains a typical mixture of music, speech, and individual instruments.
Ideally, the points would gather along a straight line through the zero point. The expanse of the point series with the deviations from the ideal line makes the inaccurate estimation clear.
Thus, what is disadvantageous in the concept shown in FIG. 6 is the deviation, which makes itself felt in that e.g. a value too high for the perceptual entropy arises, which in turn means that it is signaled to the quantizer that more bits than actually required are needed. This leads to the fact that the quantizer quantizes too finely, i.e. that it does not exhaust the measure of admissible interference, which results in reduced coding gain. On the other hand, if the value for the perceptual entropy is determined too small, it is signaled to the quantizer that fewer bits than actually required are needed for encoding the signal. In turn, this results in the fact that the quantizer is quantizing too coarsely, which would immediately lead to an audible interference in the signal, should no countermeasures be taken. The countermeasures may be that the quantizer still requires one or more further iteration loops, which increases the computation time of the coder.
For improving the calculation of the perceptual entropy, a constant term, such as 1.5, could be introduced into the logarithmic expression, as it is shown in FIG. 7. Then a better result can already be obtained, i.e. a smaller upward or downward deviation, although it can nevertheless be seen that, when taking a constant term in the logarithmic expression into account, the case that the perceptual entropy signals too optimistic a need for bits is indeed reduced. On the other hand, it can be seen clearly from FIG. 7, however, that too high a number of bits is signaled significantly, which leads to the fact that the quantizer will always quantize too finely, i.e. that the bit need is assumed greater than it actually is, which in turn results in reduced coding gain. The constant in the logarithmic expression is a coarse estimation of the bits required for the side information.
Thus, inserting a term into the logarithmic expression indeed provides an improvement of the band-wise perceptual entropy, as it is illustrated in FIG. 6, since the bands with very small distance between energy and masking threshold are more likely to be taken into account, since a certain amount of bits is also required for the transmission of spectral coefficients quantized to zero.
A further, but very computation-time-intensive calculation of the perceptual entropy is illustrated in FIG. 8. In FIG. 8, the case in which the perceptual entropy is calculated in line-wise manner is shown. The disadvantage, however, lies in the higher computation outlay of the line-wise calculation. Here, instead of the energy, spectral coefficients X(k) are employed, wherein kOffset(b) designates the first index of band b. When comparing FIG. 8 to FIG. 7, a reduction in the upward “excursions” can be seen clearly in the range from 2,000 to 3,000 bits. The PE estimation therefore will be more accurate, i.e. not estimate too pessimistically, but rather lie at the optimum, so that the coding gain may increase in comparison with the calculation methods shown in FIGS. 6 and 7, and/or the number of iterations in the quantizer is reduced.
The computation time required to evaluate the equation shown in FIG. 8 is, however, disadvantageous in the line-wise calculation of the perceptual entropy.
Such computation time disadvantages not necessarily play any role if the coder runs on a powerful PC or a powerful workstation. But things look completely different if the coder is accommodated in a portable device, such as a cellular UMTS telephone, which on the one hand has to be small and inexpensive, on the other hand must have low current need, and additionally must work quickly, in order to enable the coding of an audio signal or video signal transmitted via the UMTS connection.