Bit rate scalability is emerging as a major requirement in compression systems aimed at wireless and networking applications. A scalable bit stream allows the decoder to produce a coarse reconstruction if only a portion of the entire coded bit stream is received, and to improve the quality when more of the total bit stream is made available. Scalability is especially important in applications such as digital broadcasting and multicast, which require simultaneous transmission over multiple channels of differing capacity. Further, a scalable bit stream provides robustness to packet loss for transmission over packet networks (e.g., over the Internet). A recent standard for scalable audio coding is MPEG-4 which performs multi-layer coding using Advanced Audio Coding (AAC) modules.
Advanced Audio Coding in the Base-layer
FIG. 1 shows a block diagram of a conventional base-layer AAC encoder module 10. The “transform and pre-processing” block 12 converts the time domain data 14 into the spectral domain 16. A switched modified discrete cosine transform is used to obtain a frame of 1024 spectral coefficients. The time domain data 14 is also used by the psychoacoustic model 18 to generate the masking threshold 20 for the spectral coefficients 14. The spectral coefficients are conventionally grouped into 49 bands to mimic the critical band model of the human auditory system. All transform coefficients within a given band are quantized (block 22) using the same generic non-uniform Scalar Quantizer (SQ). Equivalently, the transform coefficients are compressed by a corresponding non-linear reversible compression function c(x) 24 (which for AAC is |x|0.75), and then quantized using a Uniform SQ (USQ) 26 after a dead-zone rounding of 0.0946 (see FIG. 2). We thus have
 ix=sign[x].nint{Δc(x)−0.0946},{circumflex over (x)}=sign[ix].c−1(|ix|+0.0946)/Δ),  (1)where, x and {circumflex over (x)} are original and quantized coefficients, Δ is the quantizer scale factor of the band and, nint and sign represent nearest-integer and signum functions respectively.
Exemplary implementations of the scale factor 28 and quantization blocks 30 of FIG. 1 are shown in further detail in FIG. 2. The quantizer scale factor Δi 32 of each band is adjusted to match the masking profile, and thus, to minimize the average NMR of the frame for the given bit rate. The quantized coefficients 34 in each band are integers which are entropy coded using a Huffman codebook (not shown), and transmitted to the decoder. The quantizer scale factor Δi 32 for each band is transmitted as side information. The decoder 36 uses the same Huffman codebook to decode the encoded data, descaling it (Δi−1) and expanding it (c−1)to reconstruct a replica {circumflex over (x)} of the original data x.
In the case of audio signal, it is generally true that when the value of a particular coefficient is high, a higher amount of distortion can be allowed in its quantization while maintaining perceptual quality. Therefore, a non-uniform quantizer, which may be implemented as a compressor 24 and USQ 26 in the companded domain, is used in AAC to quantize the coefficients. Since the allowed distortion, or the masking threshold associated with each band is not necessarily constant, the quantizer scale factor will vary from band to band, and AAC transmits these stepsizes as side information. A widely used metric for measuring the distortion is the noise-to-mask ratio (NMR), which is a weighted MSE (WMSE) measure. Typically, the PsychoAcoustic Model will define the WSME metric to measure the perceived distortion, and the quantizer scale factors are selected to minimize that WSME distortion metric.
Re-quantization in the Enhancement-layer
FIG. 3 shows a conventional direct re-quantization approach for a bit rate scalable coder. Such an approach, for example, is applied in each band of a two-layer scalable AAC. Here, Δb 40 and Δe 42 represent the quantizer scale factors for the base and the enhancement-layer, respectively. The reconstruction error z is computed by subtracting (adder 44) the reconstructed base-layer data {circumflex over (x)}b from the original data x, and the enhancement-layer directly re-quantizes that reconstruction error z. The replica of x (i.e., {circumflex over (x)}) is generated by adding the reconstructed approximations from the base-layer and the enhancement-layer, i.e., {circumflex over (x)}b and {circumflex over (z)} respectively. The quantized indices and the quantizer scale factor are transmitted separately for the base-layer as well as for the enhancement-layer. The scale factors are chosen so as to minimize the distortion in the frame, for the target bit rate at that layer.
In a typical conventional approach to scalable coding, each enhancement-layer merely performs a straightforward re-quantization of the reconstruction error of the preceding layer, typically using a straightforward re-scaled version of the previously used quantizer. Such a conventional approach yields good scalability when the distortion measure in the base-layer is an unweighted mean squared error (MSE) metric. However, a majority of practically employed objective metrics do not use MSE as the quality criterion and a simple direct re-quantization approach will not in general result in optimizing the distortion metric for the enhancement-layer. For example, in conventional scalable AAC, the enhancement-layer encoder searches for a new set of quantizer scale factors, and transmits their values as side information. However, the information representing the scale factors may be substantial. At low rates, of around 16 kbps, the information about quantizer scale factors of all the bands constitutes as much as 30%-40% of the bit stream in AAC.