1. Field
The following description generally relates to encoders and decoders and, in particular, to an efficient way of coding modified discrete cosine transform (MDCT) spectrum as part of a scalable speech and audio codec.
2. Background
One goal of audio coding is to compress an audio signal into a desired limited information quantity while keeping as much as the original sound quality as possible. In an encoding process, an audio signal in a time domain is transformed into a frequency domain.
Perceptual audio coding techniques, such as MPEG Layer-3 (MP3), MPEG-2 and MPEG-4, make use of the signal masking properties of the human ear in order to reduce the amount of data. By doing so, the quantization noise is distributed to frequency bands in such a way that it is masked by the dominant total signal, i.e. it remains inaudible. Considerable storage size reduction is possible with little or no perceptible loss of audio quality.
Perceptual audio coding techniques are often scalable and produce a layered bit stream having a base or core layer and at least one enhancement layer. This allows bit-rate scalability, i.e. decoding at different audio quality levels at the decoder side or reducing the bit rate in the network by traffic shaping or conditioning.
Code excited linear prediction (CELP) is a class of algorithms, including algebraic CELP (ACELP), relaxed CELP (RCELP), low-delay (LD-CELP) and vector sum excited linear predication (VSELP), that is widely used for speech coding. One principle behind CELP is called Analysis-by-Synthesis (AbS) and means that the encoding (analysis) is performed by perceptually optimizing the decoded (synthesis) signal in a closed loop. In theory, the best CELP stream would be produced by trying all possible bit combinations and selecting the one that produces the best-sounding decoded signal. This is obviously not possible in practice for two reasons: it would be very complicated to implement and the “best sounding” selection criterion implies a human listener. In order to achieve real-time encoding using limited computing resources, the CELP search is broken down into smaller, more manageable, sequential searches using a perceptual weighting function. Typically, the encoding includes (a) computing and/or quantizing (usually as line spectral pairs) linear predictive coding coefficients for an input audio signal, (b) using codebooks to search for a best match to generate a coded signal, (c) producing an error signal which is the difference between the coded signal and the real input signal, and (d) further encoding such error signal (usually in an MDCT spectrum) in one or more layers to improve the quality of a reconstructed or synthesized signal.
Many different techniques are available to implement speech and audio codecs based on CELP algorithms. In some of these techniques, an error signal is generated which is subsequently transformed (usually using a DCT, MDCT, or similar transform) and encoded to further improve the quality of the encoded signal. However, due to the processing and bandwidth limitations of many mobile devices and networks, efficient implementation of such MDCT spectrum coding is desirable to reduce the size of information being stored or transmitted.