In audio and speech coding, there are mainly two types of coding approaches: Transform Coding and Linear Prediction Coding.
Transform coding involves the transformation of the signal from time domain to spectral domain, such as using Discrete Fourier Transform (DFT: Discrete Fourier Transform) or Modified Discrete Cosine Transform (MDCT: Modified Discrete Cosine Transform). The spectral coefficients are quantized and encoded. In the process of quantization or encoding, psychoacoustic model is normally applied to determine the perceptual importance of the spectral coefficients, and then the spectral coefficients are quantized or encoded according to their perceptual importance. Some popular transform codecs are MPEG MP3, MPEG AAC (see NPL 1) and Dolby AC3. Transform coding is effective for music or general audio signals. A simple framework of transform codec is shown in FIG. 1.
In the encoder illustrated in FIG. 1, the time domain signal S(n) is transformed into frequency domain signal S(f) using time to frequency transformation method (101), such as Discrete Fourier Transform (DFT) or Modified Discrete Cosine Transform (MDCT).
Psychoacoustic model analysis is done on the frequency domain signal S(f) to derive the masking curve (103). Quantization is performed on the frequency domain signal S(t) according to the masking curve derived from the psychoacoustic model analysis to ensure that the quantization noise is inaudible (102).
The quantization parameters are multiplexed (104) and transmitted to the decoder side.
In the decoder illustrated in FIG. 1, at the start, all the bitstream information is de-multiplexed (105). The quantization parameters arc dequantized to reconstruct the decoded frequency domains signal {tilde over (S)}(f) (106).
The decoded frequency domain signal {tilde over (S)}(f) is transformed back to time domain, to reconstruct the decoded time domain signal {tilde over (S)}(n) using frequency to time transformation method (107), such as Inverse Discrete Fourier Transform (IDFT: Inverse Discrete Fourier Transform) or Inverse Modified Discrete Cosine Transform (IMDCT: Inverse Modified Discrete Cosine Transform).
On the other hand, linear prediction coding exploits the predictable nature of speech signals in time domain, obtains the residual/excitation signal by applying linear prediction on the input speech signal. For speech signal, especially for voiced regions, which have resonant effect and high degree of similarity over time shifts that are multiples of their pitch periods, this modelling produces very efficient presentation of the sound. After the linear prediction, the residual/excitation signal is mainly encoded by two different methods, TCX and CELP.
In TCX (see NPL 2), the residual/excitation signal is transformed and encoded efficiently in the frequency domain. Some popular TCX codecs are 3GPP AMR-WB+, MPEG USAC. A simple framework of TCX codec is shown in FIG. 2.
In the encoder illustrated in FIG. 2, LPC analysis is done on the input signal to exploit the predictable nature of signals in time domain (201). The LPC coefficients from the LPC analysis are quantized (202), the quantization indices are multiplexed (207) and transmitted to decoder side. With the dequantized LPC coefficients dequantized by dequantization section (203), the residual (excitation) signal Sr(n) is obtained by applying LPC inverse filtering on the input signal S(n) (204).
The residual signal Sr(n) is transformed to frequency domain signal Sr(f) using time to frequency transformation method (205), such as Discrete Fourier Transform (DFT) or Modified Discrete Cosine Transform (MDCT).
Quantization is performed on Sr(f) (206) and quantization parameters are multiplexed (207) and transmitted to the decoder side.
In the decoder illustrated in FIG. 2, at the start, all the bitstream information is de-multiplexed at (208).
The quantization parameters are dequantized to reconstruct the decoded frequency domain residual signal {tilde over (S)}r(f) (210).
The decoded frequency domain residual signal {tilde over (S)}r(f) is transformed back to time domain, to reconstruct the decoded time domain residual signal {tilde over (S)}r(n) using frequency to time transformation method (211), such as Inverse Discrete Fourier Transform (IDFT) or Inverse Modified Discrete Cosine Transform (IMDCT).
With the dequantized LPC parameters dequantized by the dequantization section (209), the decoded time domain residual signal {tilde over (S)}r (n) is processed by LPC synthesis filter (212) to obtain the decoded time domain signal {tilde over (S)}(n).
In the CELP coding, the residual/excitation signal is quantized using some predetermined codebook. And in order to further enhance the sound quality, it is popular to transform the difference signal between the original signal and the LPC synthesized signal to frequency domain and further encode. Some popular CELP codecs are ITU-T G.729.1 (see NPL 3), ITU-T G.718 (see NPL 4). A simple framework of hierarchical coding (layered coding, embedded coding) of CELP and transform coding is shown in FIG. 3.
In the encoder illustrated in FIG. 3, CELP encoding is done on the input signal to exploit the predictable nature of signals in time domain (301). With the CELP parameters, the synthesized signal is reconstructed by the CELP local decoder (302). The prediction error signal Se(n) (the difference signal between the input signal and the synthesized signal) is obtained by subtracting the synthesized signal from the input signal.
The prediction error signal Se(n) is transformed into frequency domain signal Se(f) using time to frequency transformation method (303), such as Discrete Fourier Transform (DPT) or Modified Discrete Cosine Transform (MDCT).
Quantization is performed on Se(f) (304) and quantization parameters are multiplexed (305) and transmitted to the decoder side.
In the decoder illustrated in FIG. 3, at the start, all the bitstream information is de-multiplexed (306).
The quantization parameters are dequantized to reconstruct the decoded frequency domain residual signal {tilde over (S)}e(f) (308).
The decoded frequency domain residual signal {tilde over (S)}e(f) is transformed back to time domain, to reconstruct the decoded time domain residual signal {tilde over (S)}e(n) using frequency to time transformation method (309), such as Inverse Discrete Fourier Transform (IDFT) or Inverse Modified Discrete Cosine Transform (IMDCT).
With the CELP parameters, the CELP decoder reconstructs the synthesized signal Ssyn(n) (307), the decoded time domain signal {tilde over (S)}(n) is reconstructed by adding the CELP synthesized signal Ssyn(n) and the decoded prediction error signal {tilde over (S)}e(n).
The transform coding and the transform coding part in linear prediction coding are normally performed by utilizing some quantization methods.
One of the vector quantization methods is named as split multi-rate lattice VQ or algebraic VQ (AVQ) (see NPL 5). In AMR-WB+ (see NPL 6), split multi-rate lattice VQ is used to quantize the LPC residual in TCX domain (as shown in FIG. 4). In the newly standardized speech codec ITU-T G.718, split multi-rate lattice VQ is also used to quantize the LPC residue in MDCT domain as residue coding layer 3.
Split multi-rate lattice VQ is a vector quantization method based on lattice quantizers. Specifically, for the split multi-rate lattice VQ used in AMR-WB+ (sec NPL 6), the spectrum is quantized in blocks of 8 spectral coefficients using vector codebooks composed of subsets of the Gosset lattice, referred to as the RE8 lattice (see NPL 5).
All points of a given lattice can be generated from the so-called squared generator matrix G of the lattice, as c=s·G, where s is a line vector with integer values and c is the generated lattice point.
To form a vector codebook at a given rate, only lattice points inside a sphere (in 8 dimensions) of a given radius are taken. Multi-rate codebooks can thus be formed by taking subsets of lattice points inside spheres of different radii.
A simple framework which utilizes the split multi-rate vector quantization in TCX codec is illustrated in FIG. 4.
In the encoder illustrated in FIG. 4, LPC analysis is done on the input signal to exploit the predictable nature of signals in time domain (401). The LPC coefficients from the LPC analysis are quantized (402), the quantization indices are multiplexed (407) and transmitted to decoder side. With the dequantized LPC coefficients dequantized by dequantization section (403), the residual (excitation) signal Sr(n) is obtained by applying LPC inverse filtering on the input signal S(n) (404).
The residual signal Sr(n) is transformed to frequency domain signal Sr(f) using time to frequency transformation method (405), such as Discrete Fourier Transform (DFT) or Modified Discrete Cosine Transform (MDCT).
Split multi-rate lattice vector quantization method is applied on Sr(f) (406) and quantization parameters are multiplexed (407) and transmitted to the decoder side.
In the decoder illustrated in FIG. 4, at the start, all the bitstream information is de-multiplexed (408).
The quantization parameters are dequantized by split multi-rate lattice vector dequantization method to reconstruct the decoded frequency domain residual signal {tilde over (S)}r(f) (410).
The decoded frequency domain residual signal {tilde over (S)}r(f) is transformed back to time domain, to reconstruct the decoded time domain residual signal {tilde over (S)}r(n) using frequency to time transformation method (411), such as Inverse Discrete Fourier Transform (IDFT) or Inverse Modified Discrete Cosine Transform (IMDCT).
With the dequantized LPC parameters dequantized by the dequantization section (409), the decoded time domain residual signal {tilde over (S)}r(n) is processed by LPC synthesis filter (412) to obtain the decoded time domain signal {tilde over (S)}(n).
FIG. 5 illustrates the process of split multi-rate lattice VQ. In this process, the input spectrum S(f) is split to a number of 8-dimensional blocks (or vectors) (501), and each block (or vector) is quantized by the multi-rate lattice vector quantization method (502). In the quantization step, a global gain is firstly calculated according to the bits available and the energy level of the whole spectrum. Then for each block (or vector), the ratio between the original spectrum and the global gain is quantized by different codebooks. The quantization parameters of split multi-rate lattice VQ are the quantization index of a global gain, codebook indications for each block (or vector) and code vector indices for each block (or vector).
FIG. 6 summarizes the list of codebooks of split multi-rate lattice VQ adopted in AMR-WB+ (see NPL 6). In the table, the codebook Q0, Q2, Q3 and Q4 are the base codebooks. When a given lattice point is not included in these base codebooks, the Voronoi extension (see NPL 7) is applied, using only the Q3 or Q4 part of the base codebook. As example, in the table, Q5 is Voronoi extension of Q3, Q6 is Voronoi extension of Q4.
Each codebook consists of a number of code vectors. The code vector index in the codebook is represented by a number of bits. The number of bits is derived by equation 1 as shown below:[1]Nbits=log2(Ncv)  (Equation 1)
Here, Nbit means the number of bits consumed by the code vector index and Ncv means the number of code vector in the codebook.
In the codebook Q0, there is only one vector, the null vector, means the quantized value of the vector is 0. Therefore no bits are required for the code vector index.
As there are three sets of the quantization parameters for split multi-rate lattice VQ: the index of global gain, the indications of the codebooks and the indices of the code vectors. The bitstream are normally formed in two ways. The first method is illustrated in FIG. 7, and the second method is illustrated in FIG. 8.
In FIG. 7, the input signal S(f) is firstly split to a number of vectors. Then a global gain is derived according to the bits available and the energy level of the spectrum. The global gain is quantized by a scalar quantizer and the S(f)/G is quantized by the multi-rate lattice vector quantizer. When the bitstream is formed, the index of the global gain forms the first portion, all the codebook indications are grouped together to form the second portion and all the indices of the code vectors are grouped together to form the last portion.
In FIG. 8, the input signal S(f) is firstly split to a number of vectors. Then a global gain is derived according to the bits available and the energy level of the spectrum. The global gain is quantized by a scalar quantizer and the S(f)/G is quantized by the multi-rate lattice vector quantizer. When the bitstream is formed, the index of the global gain forms the first portion, the codebook indication followed by the code vector index for each vector is to form the second portion.