Modern digital lifestyle has much to thank to the principle of perceptual digital audio compression, such as MPEG-4AAC (MPEG=Moving Pictures Expert Group, AAC=Advanced Audio Coding) or MP3 (MPEG layer 3). Typical state of the art audio compression systems utilize time-to-frequency transform functions, such as, for example, the modified discrete cosine transform (MDCT) sub-dividing the signal in frequency bands that are formed of pluralities of spectral coefficients and quantization of these grouped coefficients with appropriate quantization algorithms, followed by an advanced coding of those coefficients with some entropy coding methods as, for example, Huffman coding.
The modified discrete cosine transform is a Fourier-related transform with the additional property of being lapped, i.e. it is designed to be performed on consecutive blocks of a larger dataset, where subsequent blocks are overlapped so that the last half of one block coincides with the first half of the next block. This overlapping, in addition to the energy-compaction qualities of the discrete cosine transform, makes the modified discrete cosine transform especially attractive for signal compression applications, since it helps to avoid artifacts stemming from block boundaries. Thus, a modified discrete cosine transform is, for example, employed in MP3 and AAC.
Unfortunately, at very low bit rates, i.e. at high compression demands, coding systems have no options, but to shut down frequency bands, i.e. replace them with silence. This method is utilized in order to meet the coding demands imposed to the codec. This introduces holes in the spectrum that are especially annoying and they are the biggest contributor to audio coding artifacts.
FIG. 8 shows a typical state of the art audio encoder for an input signal that is PCM (Pulse Code Modulation) encoded and input to a filter bank 810 and a perceptual model 815. The input signal is transformed from the temporal or time domain to the frequency domain by the filter bank 810, which is usually based on well known signal transform functions, such as the modified discrete cosine transform. The outputs of the filter bank are frequency coefficients.
At the same time the signal is evaluated by the perceptual model 815, the perceptual model evaluates the input signal by mathematically modeling the human auditory system and outputs a measure, such as for example the just noticeable distortion (JND) in units of a signal-to-mask ratio (SMR) of the input signal energy to the just noticeable distortion or noise energy.
The perceptual model block 815 and the remaining blocks in the state of the art encoder, as it is depicted in FIG. 8, treat the output of the filter bank block 810 proportionally to the critical bandwidths of the human auditory system, for example, by a grouping of the frequency coefficients in so-called scaling factor bands. A good summary of the perceptual model can be found in T. Painter and A. Spanias, “Perceptual Coding of Digital Audio”, in the proceedings of the IEEE, pp. 451-513, April 2000.
The target compression demand is met by quantization of the frequency coefficients. Before quantization, the coefficients are scaled by so-called scaling factors to determine the eventual precision of the quantization process. The bit/noise allocation block 820 is responsible for estimation or calculation of the scaling factors, so the reconstruction of the quantized values yields quantization noise just below the masking threshold estimated by the perceptual model. Under certain circumstances, the perceptual model 815 indicates that certain frequency bands are noise-like and may be modeled by generating noise with a certain energy on the decoder side. For these frequency bands, there is no need to determine scaling factors or frequency coefficients, but parameters for a noise generator at the decoder side are inserted instead. Since the parameters for the noise generator take up less amount of data than scaling factors and frequency coefficients, data rates can be saved by replacing frequency bands with generated noise. The impact of the replacement on the quality of the decoded audio signal is kept in boundaries, determined by the perceptual model. For example, a frequency band, which is to be replaced, must not exceed a certain tonality threshold, nor does it contain any transient signal. The thresholds that determine noise substitution depend on the perceptual model. In ISO/IEC 14496, for example, perceptual noise substitution as a feature of AAC is described.
An advanced coding method used in some perceptual codecs is the so-called perceptual noise substitution (PNS) of which a good summary can be found in Herrer, Jürgen, Schultes, Donald, “Extending the MPEG-4AAC Codec by Perceptual Noise Substitution”, AES document 4720.
After the bit allocation block 820 in FIG. 8, quantization is done in the quantization block 825, yielding quantized frequency coefficients, which are brought to the irrelevancy reduction block 830. The irrelevancy reduction block 830 employs signal irrelevance reduction methods, which are well known from signal theory. For example, Huffman coding, vector quantization or arithmetic coding are well known methods for signal irrelevancy reduction. An overview of these methods can, for instance, be found in K. Brandenburg, “MP3 and AAC Explained” in proceedings of the AES 17th International Conference on High-Quality Audio Coding, 1999.
In order to achieve the target coding requirements, for example, a given bit rate for the compressed signal, state of the art codecs are able to reduce the coding requirements by increasing the allowed amount of noise specified by the psycho-acoustic model or perceptual model. Referring to FIG. 8, the coding requirement is verified in block 835 and if the coding requirement is not met, the bit demand is further reduced in the reduction block 840, upon which the encoding algorithm returns to the bit/noise allocation block 820. If the coding requirement is achieved, a bit stream multiplexer block 845 multiplexes the coded quantized frequency coefficients and the coded scaling factors into a coded bit stream.
If the coding requirement is not met and the bit demand is further reduced, additional noise is introduced to the signal. As allowed noise is increased, the scaling factors are increased as well and resolution of the quantized signal is decreased, which then also decreases the bit demand. The quantization resolution can be decreased up to the point when noise gets greater than the signal itself, possibly meaning the output of the quantization block for that scaling factor will be zero. This effectively inserts a hole in the spectrum in the place where the signal of the scaling factor should be present. This operation can be iteratively repeated as long as the transmission/storing demand of the coded quantized coefficient is below the constraints imposed to the encoder. This operation always terminates successfully, even if it sets all quantized outputs to zero, cf. the flowchart in FIG. 8.
While, with the above-described state of the art method the coding requirements are effectively maintained and it functions quite well, provided that the constraints opposed to the codec are achievable without eliminating too much of scaling factors in the constraint's reduction phase, the method could fail miserably if the coding demands are set to be too high for the encoder.
This usually happens if the bit rate required is well below the requirements of the perceptual model. Non-optimized codecs would usually introduce high amounts of holes due to the shut-down of too much scaling factors in order to meet the coding constraints. Spectral holes or shut-downs are usually easily detectable by listeners and they have a huge impact on degradation of the sound quality. Signals containing spectral holes are usually classified as ringing, a swishy sound, birdies, etc.
Optimized state of the art codecs, as they can, for example, be found in 3GPP (3GPP=Third Generation Partnership Project), TS (TS=Technical Specification) 26.403, employ more advantageous strategies of coding constraints reduction, usually called hole avoidance. This strategy works by imposing maximum constraint reduction limits for each scaling factor. This ensures that no holes would be introduced in the scaling factors as long as it would be possible to reduce coding constraints for all scaling factors without violating this limit and maintaining the constraints imposed to the encoder. However, even with this advanced strategy, it is quite possible that the coding constraints will not be met and, in this case, the encoder will have no other option, but to start introducing spectral holes by eliminating scaling factors.
FIG. 9 shows spectrum plots of two codec signals, in the range of 100 Hz to 15 kHz. The codecs displayed are 32 kbps, which corresponds to a 44:1 compression ratio and 320 kbps, which corresponds to a 4.4:1 compression ratio. As it can easily be seen from FIG. 9, the 32 kbps codec was forced to introduce spectral holes in order to meet a coding demand and it can be seen by severe degradations in the upper frequency range.