Audio, i.e. acoustic energy, is analogue by its nature. It is convenient, however, to represent audio in digital form for storage or transmission purposes. Pure digital audio data obtained by sampling and digitizing an analog audio signal requires large storage capacity and channel bandwidth, particularly for high-quality audio, which for instance may be represented by 16 bits per sample at a sampling rate of 44 kHz (normal audio CD quality). Hence, digital audio is normally compressed according to various known source coding techniques.
Perceptual audio coding techniques, such as MPEG Layer-3 (MP3), MPEG-2 and MPEG-4, all make use of the signal masking properties of the human ear in order to reduce the amount of data. By doing so, the quantization noise is distributed to frequency bands in such a way that it is masked by the total signal, i.e. it remains inaudible. Considerable storage size reduction is possible with little or no perceptible loss of audio quality.
Perceptual audio coding techniques are often scalable and produce a layered bit stream having a base layer and at least one enhancement layer. This allows bit-rate scalability, i.e. decoding at different audio quality levels at the decoder side or reducing the bitrate in the network by traffic shaping or conditioning. One approach is to provide base layer encoding in mono only, and to provide an enhancement layer encoding which adds stereo quality to the audio. In this way, it is possible at the decoder side to choose to decode the base layer information only (for instance in case the receiver device at the decoder side only has one speaker) or to decode the base layer information as well as the enhancement layer information so as to generate stereo sound.
Within the context of scalable audio coding, “base layer” and “core layer” are used as synonyms.
ISO/IEC 14496-3:2001(E), Subpart 4, describes a portion of the MPEG-4 Audio standard and suggests a combination of either an MPEG-4 compliant core codec, or an external core codec of CELP type (Code Excited Linear Prediction), with an AAC (Advanced Audio Coding) enhancement layer codec so as to provide efficient bit-rate scalability.
The AMR-WB (Adaptive Multi-Rate Wideband) speech codec is one example of a CELP-type codec, which will be used in 3rd generation mobile terminals and is described in 3rd Generation Partnership Project (3GPP) TS 26.190 V5.0.0 (2001-03).
In a scalable audio coding arrangement like the one referred to in aforesaid MPEG-4 Audio standard, a frequency selective switching unit (FSSU) in the enhancement layer encoder estimates the amount of bits needed to encode either the original audio signal or a residual signal, which is derived by subtracting the original signal and the reconstructed output signal of the preceding layer (the core layer). The FSSU always selects the alternative which will need fewer bits for encoding. This decision is made for each individual frequency sub-band (i.e. for each fixed group of spectral lines representing the signal) within an audio frame. To allow reconstruction on the decoder side the encoder has to transmit FSS control information for indicating which of the two alternatives was selected for each sub-band in each audio frame. According to this control information, the output signal from the enhancement layer decoder will then be added to the output of the core layer decoder only in those sub-bands where the residual signal has been encoded.
However, the present inventors have identified the following problem with scalable audio coding arrangements like the one described above. Particularly for low and modest bit rates, e.g. in the range of 12 kbps-24 kbps, there will sometimes not be enough bits available to encode the enhancement signal in such a way, that the quantization errors remain imperceptible. At the decoder side, such errors will sound like cracks, pops, etc, and will therefore be very disturbing. In fact, such errors can even lead to a degradation in perceived quality compared to the output signal of the core layer alone.
In the prior art, to prevent this effect, one would either have to restrict the encoded frequency range, at the risk of losing audible information, or increase the bit rate for the enhancement layer codec, which may not be a desirable or even possible option in view of available network bandwidth.