1. Field of the Invention
The present invention relates to audio coding in general and, in particular, to audio coding allowing audio signals to be coded with a short delay time.
2. Description of the Related Art
The audio compression method best known at present is MPEG-1 Layer III. With this compression method, the sample or audio values of an audio signal are coded into a coded signal in a lossy manner. Put differently, irrelevance and redundancy of the original audio signal are reduced or ideally removed when compressing. In order to achieve this, simultaneous and temporal maskings are recognized by a psycho-acoustic model, i.e. a temporally varying masking threshold depending on the audio signal is calculated or determined indicating from which volume on tones of a certain frequency are perceivable for human hearing. This information in turn is used for coding the signal by quantizing the spectral values of the audio signal in a more precise or less precise manner or not at all, depending on the masking threshold, and integrating same into the coded signal.
Audio compression methods, such as, for example, the MP3 format, experience a limit in their applicability when audio data is to be transferred via a bit rate-limited transmission channel in a, on the one hand, compressed manner, but, on the other hand, with as small a delay time as possible. In some applications, the delay time does not play a role, such as, for example, when archiving audio information. Small delay audio coders, which are sometimes referred to as “ultra low delay coders”, however, are necessary where time-critical audio signals are to be transmitted, such as, for example, in tele-conferencing, in wireless loudspeakers or microphones. For these fields of application, the article by Schuller G. et al. “Perceptual Audio Coding using Adaptive Pre-and Post-Filters and Lossless Compression”, IEEE Transactions on Speech and Audio Processing, vol. 10, no. 6, September 2002, pp. 379-390, suggests audio coding where the irrelevance reduction and the redundancy reduction are not performed based on a single transform, but on two separate transforms.
The principle will be discussed subsequently referring to FIGS. 12 and 13. Coding starts with an audio signal 902 which has already been sampled and is thus already present as a sequence 904 of audio or sample values 906, wherein the temporal order of the audio values 906 is indicated by an arrow 908. A listening threshold is calculated by means of a psycho-acoustic model for successive blocks of audio values 906 characterized by an ascending numeration by “block#”. FIG. 13, for example, shows a diagram where, relative to the frequency f, graph a plots the spectrum of a signal block of 128 audio values 906 and b plots the masking threshold, as has been calculated by a psycho-acoustic model, in logarithmic units. The masking threshold indicates, as has already been mentioned, up to which intensity frequencies remain inaudible for the human ear, namely all tones below the masking threshold b. Based on the listening thresholds calculated for each block, an irrelevance reduction is achieved by controlling a parameterizable filter, followed by a quantizer. For a parameterizable filter, a parameterization is calculated such that the frequency response thereof corresponds to the inverse of the magnitude of the masking threshold. This parameterization is indicated in FIG. 12 by x# (i).
After filtering the audio values 906, quantization with a constant step size takes place, such as, for example, a rounding operation to the next integer. The quantizing noise caused by this is white noise. On the decoder side, the filtered signal is “retransformed” again by a parameterizable filter, the transfer function of which is set to the magnitude of the masking threshold itself. Not only is the filtered signal decoded again by this, but the quantizing noise on the decoder side is also adjusted to the form or shape of the masking threshold. In order for the quantizing noise to correspond to the masking threshold as precisely as possible, an amplification value a# applied to the filtered signal before quantizing is calculated on the coder side for each parameter set or each parameterization. In order for the retransform to be performed on the decoder side, the amplification value a and the parameterization x are transferred to the coder as side information 910 apart from the actual main data, namely the quantized filtered audio values 912. For the redundancy reduction 914, this data, i.e. the side information 910 and the main data 912, is subjected to a loss-free compression, namely entropy coding, which is how the coded signal is obtained.
The above-mentioned article suggests a size of 128 sample values 906 as a block size. This allows a relatively short delay of 8 ms with a sampling rate of 32 kHz. With reference to the detailed implementation, the article also states that, for increasing the efficiency of the side information coding, the side information, namely the coefficients x# and a#, will only be transferred if there are sufficient changes compared to a parameter set transferred before, i.e. if the changes exceed a certain threshold value. In addition, it is described that the implementation is preferably performed such that a current parameter set is not directly applied to all the sample values belonging to the respective block, but that a linear interpolation of the filter coefficients x# is used to avoid audible artifacts. In order to perform the linear interpolation of the filter coefficients, a lattice structure is suggested for the filter to prevent instabilities from occurring. For the case that a coded signal with a controlled bit rate is desired, the article also suggests selectively multiplying or attenuating the filtered signal scaled with the time-depending amplification factor a by a factor unequal to 1 so that audible interferences occur, but the bit rate can be reduced at sites of the audio signal which are complicated to code.
Although the audio coding scheme described in the article mentioned above already reduces the delay time for many applications to a sufficient degree, a problem in the above scheme is that, due to the requirement of having to transfer the masking threshold or transfer function of the coder-side filter, subsequently referred to as pre-filter, the transfer channel is loaded to a relatively high degree even though the filter coefficients will only be transferred when a predetermined threshold is exceeded.
Another disadvantage of the above coding scheme is that, due to the fact that the masking threshold or inverse thereof has to be made available on the decoder side by the parameter set x# to be transferred, a compromise has to be made between the lowest possible bit rate or high compression ratio on the one hand and the most precise approximation possible or parameterization of the masking threshold or inverse thereof on the other hand. Thus, it is inevitable for the quantizing noise adjusted to the masking threshold by the above audio coding scheme to exceed the masking threshold in some frequency ranges and thus result in audible audio interferences for the listener. FIG. 13, for example, shows the parameterized frequency response of the decoder-side parameterizable filter by graph c. As can be seen, there are regions where the transfer function of the decoder-side filter, subsequently referred to as post-filter, exceeds the masking threshold b. The problem is aggravated by the fact that the parameterization is only transferred intermittently with a sufficient change between parameterizations and interpolated therebetween. An interpolation of the filter coefficients x#, as is suggested in the article, alone results in audible interferences when the amplification value a# is kept constant from node to node or from new parameterization to new parameterization. Even if the interpolation suggested in the article is also applied to the side information value a#, i.e. the amplification value transferred, audible audio artifacts may remain in the audio signal arriving on the decoder side.
Another problem with the audio coding scheme according to FIGS. 12 and 13 is that the filtered signal may, due to the frequency-selective filtering, take a non-predictable form where, particularly due to a random superposition of many individual harmonic waves, one or several individual audio values of the coded signal add up to very high values which in turn result in a poorer compression ratio in the subsequent redundancy reduction due to their rare occurrence.