In embedded coding, also known as layered coding, the sound signal is encoded in a first layer to produce a first bit stream, and then the error between the original sound signal and the encoded signal (synthesis sound signal) from the first layer is further encoded to produce a second bit stream. This can be repeated for more layers by encoding the error between the original sound signal and the synthesis sound signal from all preceding layers. The bit streams of all layers are concatenated for transmission. The advantage of layered coding is that parts of the bit stream (corresponding to upper layers) can be dropped in the network (e.g. in case of congestion) while still being able to decode the encoded sound signal at the receiver depending on the number of received layers. Layered coding is also useful in multicast applications where the encoder produces the bit stream of all layers and the network decides to send different bit rates to different end points depending on the available bit rate within each link.
Embedded or layered coding can be also useful to improve the quality of widely used existing codecs while still maintaining interoperability with these codecs. Adding layers to the standard codec lower (or core) layer can improve the quality and even increase the encoded audio signal bandwidth. An example is the recently standardized ITU-T Recommendation G.729.1 in which the lower (or core) layer is interoperable with the widely used narrowband ITU-T Recommendation G.729 operating at 8 kbit/s. The upper layers of ITU-T Recommendation G.729.1 produce bit rates up to 32 kbit/s (with wideband signal starting from 14 kbit/s). Current standardization work aims at adding mode layers to produce super wideband (14 kHz bandwidth) and stereo extensions. Another example is Recommendation G.718 recently approved by ITU-T [1] for encoding wideband signals at 8, 12, 16, 24, and 32 kbit/s. This codec was previously known as EV-VBR codec and was undertaken by Q9/16 in ITU-T. In the following description, reference to EV-VBR shall mean reference to ITU-T Recommendation G.718. The EV-VBR codec is also envisaged to be extended to encode super wideband and stereo signals at higher bit rates. As a non-limitative example, the EV-VBR codec will be used in the non-restrictive, illustrative embodiments of the present invention since the technique disclosed in the present disclosure is now part of ITU-T Recommendation G.718.
The requirements for embedded codecs usually comprise good quality in case of both speech and audio signals. Since speech can be encoded at relatively low bit rate using a model-based approach, the lower layer (or first two lower layers) is encoded using a speech specific technique and the error signal for the upper layers is encoded using a more generic audio coding technique. This approach delivers a good speech quality at low bit rates and a good audio quality as the bit rate increases. In the EV-VBR codec (and also in ITU-T Recommendation G.729.1), the two lower layers are based on the ACELP (algebraic code-excited linear prediction) technique which is suitable for encoding speech signals. In the upper layers, transform-based coding suitable for audio signals is used to encode the error signal (the difference between the input sound signal and the output (synthesized sound signal) from the two lower layers). In the upper layers, the well known MDCT transform is used, where the error signal is transformed into the frequency domain using windows with 50% overlap. The MDCT coefficients can be quantized using several techniques, for example scalar quantization with Hoffman coding, vector quantization, or any other technique. In the EV-VBR codec, algebraic vector quantization (AVQ) is used to quantize the MDCT coefficients among other techniques.
The spectrum quantizer has to quantize a range of frequencies with a maximum amount of bits. Usually the amount of bits is not high enough to quantize perfectly all frequency bins. The frequency bins with highest energy are quantized first (where the weighted spectral error is higher), then the remaining frequency bins are quantized, if possible. When the amount of available bits is not sufficient, the lowest energy frequency bins are only roughly quantized and the quantization of these lowest energy frequency bins may vary from one frame to the other. This rough quantization leads to an audile quantization noise especially between 2 kHz and 4 kHz. Accordingly, there is a need for a technique for reducing the quantization noise caused by a lack of bits to quantize all energy frequency bins in the spectrum or by too large a quantization step.