A particularly attractive feature of audio codec is scalability. In general, a scalable audio codec compresses the incoming audio into a master bitstream, which may or may not include a non-scalable base layer. Later, a parser may quickly extract from the master compressed file a subset of the bitstream and form an application bitstream at a low bitrate, of a smaller number of channels, or at a reduced audio sampling rate, or a combination of any of the above. Scalable audio compression greatly eases the design constraints of many systems that utilize audio compression. In many applications, it is difficult to foresee the exact compression ratio required at the time the audio is compressed. The ability to quickly change the compression ratio may lead to a better user experience in audio storage and transmission. For example, if the compression ratio of the stored audio is adjustable, the compressed audio can be further compacted to meet the exact requirements of the customer. One can build a stretchable audio recording device, which at first, uses the highest possible compression quality (lowest possible compression ratio) to store the compressed audio. Later, when the length of the compressed audio at the highest quality exceeds the memory of the device, the compressed bitstream of the existing audio file can be truncated and leave memory for newly recorded audio content. A device with scalable audio compression technology can perform this stretching step again and again, continuously increasing the compression ratio of the existing media, freeing up the storage space and squeezing in new content. The ability to quickly adjust the compression ratio is also very useful in the media communication/streaming scenario, where the server and the client may adjust the size of the compressed audio to match the instantaneous bandwidth and condition of the network, and thus reliably deliver the best possible quality of the compressed media over network. Moreover, multiple description coding may also be applied on a scalable coded audio bitstream. The idea is to apply more protection (using forward error correction of several sorts) to the more important part of the bitstream (base layer), and to apply less protection to the less important part of the bitstream (enhancement layer). Thus, even with a large number of lost packets, the head portion of the compressed bitstream is preserved. As a result, the quality of the delivered audio degrades gracefully with an increase in the packet loss ratio.
An existing set of scalable audio tools provides various levels of scalability. The following paragraphs review a selected set of scalable audio configurations. The scalable audio tools are divided into three major groups: the pure bit-scalable audio coders, the parametric scalable audio coders, and the enhancement layer scalable audio coders.
A. Pure Bit-Scalable Audio Coders:
Two types of pure bit-scalable audio coding are BSAC (Bit sliced arithmetic coding) and Progressive-to-lossless embedded audio codec (PLEAC). In BSAC, by replacing the entropy coding core of the Advanced Audio Coding (AAC) codec with a bitplane arithmetic codec, fine grain scalability (with steps down to 1 kbps per channel) can be achieved. PLEAC is a highly flexible embedded audio coder that is capable of scaling from low bitrate all the way to lossless.
Both BSAC and PLEAC are pure bit-scalable audio coders. They do not support the use of a non-scalable base layer coder. Within the coder, they use certain gradual refinement approaches, e.g., bitplane coding (in BSAC) and sub-bitplane coding with psychoacoustic order (in PLEAC) to gradually refine the audio transform coefficients. Though the perceptual audio compression performance of these pure scalable audio coders can be satisfactory across a large bitrate range, at certain bitrate points, specifically at low bitrates, its performance may be inferior to a highly optimized non-scalable audio coder designed to operate at that bitrate. Such performance difference between the scalable and the non-scalable audio coder at low bitrates may hamper the adoption of the scalable audio coder and prevent the scalable audio coder from being used by many applications.
In many applications, very low audio quality is not acceptable, and scalability at low bit rates may not be needed. In such case, a non-scalable base-layer codec may be more efficient. A scalable codec operating on top of the base layer can be used, as will be discussed relative to enhancement layer scalable audio coding below. The existence of a base layer also allows providers, deliverers, creators, and other people who handle content to ensure a minimum quality.
The inefficiency of scalable codecs at low-bit-rates may be due to several causes including: (a) the perceptual distortion model and (b) the quantizer (which could be construed as combining signal representation, quantization, and coding.). For the perceptual distortion model, it is known that at very low bit rates, vector quantization (VQ) provides superior R-D performance. However, at high bitrates, the scalar quantizer (SQ) codec is preferred for low implementation complexity. It is difficult to build an integrated scalable codec with VQ at lower bitrates, and SQ at higher bitrates. For the quantizer, the traditional approach of calculating the masking threshold based on the input audio signal breaks down for low-bit-rate/low-quality-level coding. The alternate approach used in PLEAC lets the masking threshold be updated during the encoding process. This approach also breaks down for low-bit-rate/low-quality-level coding, as the low bit rate decoded audio signal does not have sufficient information to derive an accurate masking threshold.
B. Parametric Scalable Audio Coders.
Parametric scalable audio coding schemes include AAC+ parametric coding, scalable natural speech and parametric audio coding tools. These will be discussed in the following paragraphs.
AAC+ parametric coding, such as MPEG-4 audio, provides tools for enhancing the compression performance of the AAC-based codec by parametric coding approaches. Spectral Band Replication (SBR) synthesizes the high-frequency range of the audio signal based on the transmitted band-limited audio signal and some small side information. Parametric Stereo (PS) allows the synthesis of a stereo output based on a transmitted monophonic signal and some small amount of side information. Both SBR and PS tools allow the audio to scale beyond what is coded in the base layer. However, there are limitations on the achievable quality improvements using the SBR and PS tools, and they are not presently effective when very high audio quality is required.
Scalable natural speech coding schemes include Harmonic Vector Excitation Coding (HVXC), Code Excited Linear Prediction (CELP) and parametric audio coding tools such as Harmonic and Individual Lines and Noise (HILN) coding. Within a single coding scheme of HVXC, CELP, or HILN, MPEG-4 can also provide a certain degree of scalability. HVXC and CELP provide scalability in 2 kbps steps for narrowband (8 kHz sampling) speech. CELP also allows bandwidth scalability from narrowband speech to wideband (16 kHz sampling) speech using a 10 kbps enhancement layer. HILN provides scalable configurations with a base layer and one or more additional extension layers.
In general, a parametric scalable audio coding approach may be used to enhance the performance of the base layer coder. All the above scalability tools can only achieve Large Step (or coarse grain) scalability. Moreover, there is no tool that allows the coded bitstream to scale from the low bitrate parametric audio coding to the more generic waveform audio coding. As a result, parametric scalable audio coders do not scale all the way to perceptual lossless or true lossless.
C. Enhancement Layer Scalable Audio Coders.
Two types of enhancement layer scalable audio codecs include scalable MC and scalable towards high quality/lossless schemes.
In scalable MC, several stages of MC codec can be cascaded to achieve so-called Large Step Scalability (e.g. 8 kbps steps). This approach achieves good compression performance at the base layer. However, the performance degrades with the increase of the number of stages. There are two main shortcomings of the approach. First, each encoding layer of scalable MC re-quantizes the reconstruction error of the preceding layer using a nonuniform quantizer and a quantization step size that is a power of 2^(¼). Second, the source coder of MC is optimized to encode the quantized coefficients of the base layer. It is far from optimal in encoding the residue error in the enhancement layer. Because of both, scalable MC's performance is well below that of non-scalable MC at any rate beyond the base-layer rate.
One scalable towards high quality/lossless coding scheme, the Scalable Lossless Coding (SLS) scheme, is designed to provide fine-granular enhancement up to lossless reconstruction. In short, the key here is to replace the float Modified Discrete Cosine Transform (MDCT) with a low noise MDCT, and then use an entropy coder that can code the coefficients all the way to the lossless. As scalable MC, SLS yields scalability only in the mean squared error (MSE) sense and not the perceptual sense.
Both enhancement layer scalable audio coders above employ a good non-scalable audio coder as the base layer. Then, the residue between the decoded base layer audio and the original audio are encoded (in large step refinement or fine grain refinement) by an enhancement layer coder. What is significant and missing among the existing scalable audio coding approaches is the use of the psychoacoustic information embedded in the base layer and/or the error signal to guide the scalable coding for the enhancement layer, thereby achieving not MSE scalability, but perceptual scalability. Moreover, as enhancement information is added, additional psychoacoustic information may be available, but is not used to guide the formation of additional enhancement information.