The present invention relates to audio coding methods and in particular to audio coding methods according to the Standard ISO/MPEG, such as e.g. MPEG-1, MPEG-2, MPEG-2 AAC, for the data-reduced representation of high quality audio signals.
The standardization body ISO/IEC JTC1/SC29/WG11, which is also known as the Moving Pictures Expert Group (MPEG), was founded in 1988 in order to specify digital video and audio coding schemes for low data rates. In November 1992 the first specification phase was completed with the Standard MPEG-1. The audio coding system according to MPEG-1, which is specified in ISO 11172-3, works in a one-channel or two-channel stereo mode at sampling frequencies of 32 kHz, 44.1 kHz and 48 kHz. The Standard MPEG-1 Layer II delivers radio quality, as it is specified by the International Telecommunication Union, at a data rate of 128 kb/s per channel.
In its second development phase the aims of MPEG were to define a multichannel extension for MPEG-1 audio which should be backwards compatible with the existing MPEG-1 systems, and also to define an audio coding standard at lower sampling frequencies (16 kHz, 22.5 kHz, 24 kHz) than in MPEG-1. The backwards compatible standard (MPEG-2 BC) and the standard with lower sampling frequencies (MPEG-2 LSF) were completed in November 1994. MPEG-2 BC delivers a good audio quality at data rates of 640-896 kb/s for 5 channels with full bandwidth. Since 1994 the MPEG-2 Audio Standardization Committee has been striving to define a multichannel standard with higher quality than is attainable if backwards compatibility with MPEG-1 is required. This non-backwards-compatible audio standard according to MPEG-2 is denoted by MPEG-2 NBC. The aim of this development is to achieve radio quality according to the ITU-R requirements at data rates of 384 kb/s or less for 5-channel audio signals for which each channel has the full bandwidth. The Audio Coding Standard MPEG-2 NBC was completed in April 1997. The scheme MPEG-2 NBC will become the nucleus of the already planned Audio Standard MPEG-4, which will have higher data rates (over 40 kb/s per channel). The NBC or non-backwards compatible standard combines the coding efficiency of a high-resolution filter bank, of prediction techniques and of the redundancy reducing Huffman coding to achieve an audio coding with radio quality at very low data rates. The Standard MPEG-2 NBC is also denoted by MPEG-2 NBC AAC (AAC=Advanced Audio Coding). A detailed description of the technical content of MPEG-2 AAC is to be found in M. Bosi, K. Brandenburg, S. Quackenbush, L. Fiedler, K. Akagiri, H. Fuchs, M. Dietz, J. Herre, G. Davidson, Yoshiaki Oikawa: xe2x80x9cISO/IEC MPEG-2 Advanced Audio Codingxe2x80x9d, 101st AES Convention, Los Angeles 1996, Preprint 4382.
Efficient audio coding methods remove both redundancies and irrelevancies from audio signals. Correlations between audio sampling values and statistics of the sampling value representation are exploited so as to remove redundancies. Frequency domain and time domain masking properties of the human auditory system are exploited so as to remove imperceptible signal content (irrelevancies). The frequency content of the audio signal is subdivided into subbands by means of a filter bank. The data rate reduction is achieved by quantizing the spectrum of the time-domain signal according to psychoacoustic models and may include a lossless coding method.
Generally speaking, a time-continuous audio signal is sampled so as to obtain a time-discrete audio signal. The time-discrete audio signal is windowed by means of a window function so as to obtain successive blocks or frames with a certain number, e.g. 1024, of windowed time-discrete sampled values. Each block of windowed time-discrete sampled audio signal values is transformed in turn into the frequency domain, which may be achieved using a modified discrete cosine transform (MDCT) for example. Since the spectral values obtained in this way are not yet quantized, it is necessary to quantize them. Here the main aim is to quantize the spectral data in such a way that the quantization noise is masked or concealed by the quantized signals themselves. This is achieved with the aid of a psychoacoustic model described in the MPEG AAC Standard which, taking account of the special properties of the human ear, calculates masking thresholds depending on the audio signal involved. The spectral values are now quantized in such a way that the quantized noise which is introduced is concealed and therefore inaudible. The quantization does not therefore result in any audible noise.
In the NBC Standard a so-called non-uniform quantizer is used. Additionally, a method for shaping the quantization noise is used. The NBC method, like previous standards, employs the individual amplification of groups of spectral coefficients, which are known as scale factor bands. To work as efficiently as possible it is desirable to be able to shape the quantization noise into units which are based as closely as possible on the frequency groups of the human auditory system. In this way it is possible to group together spectral values which very closely reflect the bandwidth of the frequency groups. Individual scale factor bands can be amplified by means of scale factors in stages of 1.5 dB. The noise shaping is achieved since amplified coefficients have larger amplitudes. They will therefore in general have a higher signal/noise ratio after quantization. On the other hand, larger amplitudes require more bits for the coding, i.e. the bit distribution between the scale factor bands is implicitly changed. The amplification through the scale factors must of course be corrected in the decoder. For this reason the amplification information, which is stored in the scale factors in units of 1.5 dB steps, must be transmitted to the decoder as side information.
After quantization of the spectral values, possibly amplified through scale factors, in the scale factor bands, the spectral values themselves should be coded. The input signal into a noiseless coding module is thus the set of e.g. 1024 quantized spectral coefficients. The sets of 1024 quantized spectral coefficients are partitioned by the noiseless coding module into xe2x80x9csectionsxe2x80x9d in such a way that a single Huffman codebook is used to code each section. For reasons of coding efficiency, section boundaries can only exist at scale factor band boundaries such that for each section of the spectrum both the length of the section in scale factor bands and the Huffman codebook number used for the section must be transmitted as side information.
The forming of the sections is dynamic and varies typically from block to block in such a way that the number of bits needed to represent the full set of quantized spectral coefficients is minimized. The Huffman coding is used to represent n-tuples of quantized coefficients, the Huffman code being derived from one of 12 codebooks. The maximum absolute value of the quantized coefficients which can be represented by each Huffman codebook and the number of coefficients in each n-tuple for each codebook are specified a priori.
The point of forming the sections thus consists in grouping together regions with the same signal statistics so as to obtain, with a single Huffman codebook for a section, the highest possible coding gain, the coding gain generally being defined as the quotient of the bits before coding and the bits after coding. By means of a codebook number, which is specified in the bit stream syntax used for the NBC method, one of the 12 Huffman codebooks is referred to, namely the one which makes possible the highest coding gain for a specific section. The expression xe2x80x9ccodebook numberxe2x80x9d in this application is thus meant to designate the place in the bit stream syntax which is reserved for the codebook number. To code 11 different codebook numbers in binary, 4 bits are required. For each section, i.e. for each group of spectral values, these 4 bits must be transmitted as side information to enable the decoder to select the correct appropriate codebook for decoding.
Another technique which has aroused interest of late is that of xe2x80x9cnoise substitutionxe2x80x9d, the aspects of which are described in detail in Donald Schulz: xe2x80x9cImproving Audio Codecs by Noise Substitutionxe2x80x9d, Journal of the Audio Eng. Soc., Vol. 44, No. 7/8, pp. 593-598, July/August 1996. As already mentioned, traditional audio coding algorithms normally use masking effects of the human ear to reduce decisively the data rate or the number of bits to be transmitted. Masking thus means that one or more frequency components as spectral values render inaudible other components with lower levels. This effect can be exploited in two ways. Firstly, audio signal components which are masked by other components do not have to be coded. Secondly, the introduction of noise through the quantization just described is permissible if this noise is concealed by components of the original signal.
With noisy signals the human auditory system is not capable of detecting the exact variation of such a signal with time. As a consequence, in common algorithms even the waveform of the white noise, which is practically irrelevant for the human ear, was coded. Unless special measures are taken, coding noisy signals taking account of the human ear thus entails high bit rates for information which is inaudible. If, however, noisy components of signals are detected and coded with information on their noise level, their frequency range or duration, such superfluous coding can be reduced, which can result in very considerable bit economies. This fact is underpinned by the science of psychoacoustics, which teaches that the perception of noise signals depends primarily on their spectral composition and not on the actual waveform. This therefore makes it possible to use the noise substitution technique in the data reduction of audio signals.
The coder is thus faced with the task of finding or recognizing noise-like or noisy spectral values in the whole spectrum of the audio signal. One definition of noisy spectral values is as follows: If a signal component can be characterized by its level, its frequency range and its duration in such a way that it can be reconstructed by a noise substitution method without audible differences for the human auditory system, this signal component is classified as noise. The detection of this characteristic can be performed either in the frequency domain or in the time domain, as is described in the publication last cited. The simplest method consists e.g. in detecting tonal, i.e. non-noisy, components by using a time-frequency transform and by following stationary peaks in successive time-domain spectra. These peaks are described as tonal, everything else as noisy. This represents a relatively coarse noise detection, however. Another possibility of distinguishing between noisy and tonal spectral components is to use a predictor for spectral values in successive blocks. Here a prediction is performed from one spectrum to the following spectrum, i.e. the spectrum which is assigned to the next time-domain block or frame. If a predicted spectral value does not differ, or differs only slightly, from a spectral value of the next time-domain block or frame which is actually ascertained by transform, it is assumed that this spectral value represents a tonal spectral component. From this a tonality measure p can be derived, whose value forms the basis of a decision for distinguishing between tonal and noisy spectral values. This detection method is only suitable for strictly stationary signals, however. It fails to detect situations involving sine signals which change their frequencies slightly as a function of time. Such signals often appear in audio signals, e.g. as vibratos, and it is obvious to a person skilled in the art that these cannot be replaced by a noisy component.
A further possibility for detecting noisy signals is noise detection by prediction in the time domain. An adapted filter is suitable for use here as the predictor, which can be used time after time to perform a linear prediction, as is sufficiently well known in the technical field. Past audio signals are fed in and the output signal is compared with the actual audio sampling value. If the prediction error is small, tonality can be assumed. To determine the character of different frequency regions, i.e. to detect whether a group of spectral values in the spectral region is a noisy group, time-frequency transforms of the original and of the predicted signal must be carried out. A tonality measure can then be calculated for each frequency group by comparing the original and the predicted values with each other. A major problem thereby is the limited dynamic range of the predictor. A noisy frequency group with a high level dominates the predictor because of the large error which results. Other frequency regions with tonal components could be interpreted as noisy. This problem can be mitigated by using an iterative algorithm wherein the error signal normally has a lower level than the original signal and is fed in again by another predictor, after which the two predicted signals are added together. Further methods are explained in Schulz""s publication.
The group of spectral values now classified as noisy is not quantized and transmitted to the receiver in entropy-coded or redundant-coded form (by means of a Huffman codebook e.g.) as is normally the case. Instead, only an identification indicating the noise substitution and a measure of the energy of the noisy group of spectral values are transmitted as side information. In the receiver random values (noise) with the transmitted energy are then inserted for the substituted coefficients. The noisy spectral values are thus replaced by random spectral values with the corresponding energy measure.
Through the transmission of a single item of energy information instead of a group of codes, i.e. a plurality of quantized and coded spectral values, for the quantized spectral coefficients, considerable data economies are possible. It is obvious that the attainable data rate economies depend on the signal. Should the signal have very low noise content, i.e. very few noisy groups, or have transient properties, the possible data rate economy will be smaller than when a very noisy signal with very many noisy groups is coded.
The Standard MPEG-2 Advanced Audio Coding (AAC) described at the outset does not support the possibility of noise substitution. The considerable data rate economies are thus not possible with the current standard.
It is the object of the present invention to extend the scope of the existing Standard MPEG-2 AAC to include the possibilities of noise substitution in such a way that neither the fundamental coding structure nor the structure of the existing bit stream syntax are affected.
In accordance with a first aspect of the present invention, this object is achieved by a method for signalling a noise substitution when coding an audio signal, comprising the steps of: transforming a time-domain audio signal into the frequency domain to obtain spectral values; grouping the spectral values together to form groups of spectral values; detecting whether a group of spectral values is a noisy group; if a group is not noisy, allocating a codebook from a plurality of codebooks for the redundancy coding of the non-noisy group, the codebook allocated to the group being referred to by means of a codebook number; and if a group is noisy, allocating an additional codebook number, which does not refer to a codebook, to this group to signal that this group is noisy and is therefore not redundancy coded.
In accordance with a second aspect of the present invention, this object is achieved by a method for coding an audio signal, comprising the steps of: signalling a noise substitution according to the method of the above outlined first aspect of the present invention; calculating a measure of the energy of a noisy group; entering the measure of the energy in the side information assigned to the group; entering the additional codebook number in the side information assigned to the group; quantizing the non-noisy groups and coding the quantized non-noisy groups using the codebook referred to by the codebook number, whereas no quantization or coding takes place for noisy groups; and forming a bit stream which comprises quantized and coded non-noisy groups and, for noisy groups, a measure of the energy of the spectral values of the noisy groups and the additional codebook number for signalling the noisy groups.
In accordance with a third aspect of the present invention, this object is achieved by a method for decoding a coded audio signal, comprising the steps of: receiving a bit stream; redundancy decoding non-noisy groups on the basis of a codebook indicated by a codebook number and requantizing redundancy-decoded, quantized spectral values; identifying a noisy group of spectral values on the basis of an additional codebook number which is assigned to such a group; establishing a measure of the energy of the spectral values in the noisy group on the basis of the side information assigned to the group; generating noise spectral values for the noisy group, the measure of the energy of the noise spectral values in the noisy group being the same as the measure of the energy of the spectral values of the noisy group in the original signal; and transforming the requantized spectral values and the noise spectral values into the time domain to obtain a decoded audio signal.
The present invention is based on the finding that in the case where a noise substitution is performed for a noisy band no quantization and redundancy coding or Huffman coding of spectral values need be performed. Instead, as has already been described, noise spectral values for a noisy group are generated in the decoder, the measure of the energy of said spectral values corresponding to the measure of the energy of the noise-substituted spectral values. In other words, no codebooks are used for noisy groups since no redundancy coding takes place. Consequently the codebook number, i.e. the corresponding place in the bit stream syntax of the coded audio signal, is also superfluous. This place in the bit stream syntax, i.e. the codebook number, can now according to the present invention be used to indicate that a group is noisy and subject to a noise substitution. As has also been mentioned, only 12 codebooks are envisaged; however the place in the bit stream syntax provides for 4 bits, with which a number range of 0-15 can be represented in total in binary, so that so-called additional codebook numbers exist which do not point to any codebook. Only the codebook numbers 0-11 point to a codebook. In a preferred embodiment of the present invention the codebook number 13 is used to signal to the decoder that the group which has the codebook number 13, i.e. the additional codebook number, in its side information is a noisy group and has been subjected to a noise substitution. For persons skilled in the art it is, however, obvious that the additional or free codebook number 12, 14 or 15 can be employed.