For audio and speech coding, transform coding and linear predictive coding are two major coding methods. The transform coding and linear predictive coding will be described below.
(1) Transform Coding
Transform coding transforms a time domain signal into a spectral domain signal by using Discrete Fourier Transform (DFT), Modified Discrete Cosine Transform (MDCT) or the like, and quantizes and encodes individual spectral coefficients.
In quantization or coding processing, psychoacoustic model is generally applied to determine the perceptual importance of individual spectral coefficients, and the spectral coefficients are quantized or encoded according to their perceptual importance. Transform coding is effective for music or general audio signals. Examples of transform codec include MP3 (MPEG Audio Layer 3), AAC (Advanced Audio Coding) (see NPL 1), and Dolby AC3 (Audio Code number 3).
A simple configuration of a transform codec is illustrated in FIGS. 1A and 1B. In encoder 10 illustrated in FIG. 1A, time-frequency transform section 11 transforms time domain signal S(n) to frequency domain signal S(f) using a time-frequency transform method such as DFT or MDCT and outputs frequency domain signal S(f) to psychoacoustic model analysis section 12 and quantization section 13.
Psychoacoustic model analysis section 12 performs a psychoacoustic model analysis on frequency domain signal S(f) to obtain a masking curve.
Further, quantization section 13 quantizes frequency domain signal S(f) according to the masking curve in order to make the quantization noise inaudible.
The individual quantized parameters are multiplexed by multiplexing section 14 and sent as bit-stream information to the decoder side.
In decoder 20 illustrated in FIG. 1B, all the bit-stream information sent from the encoder side is demultiplexed by demultiplexing section 21. The demultiplexed quantized parameters are de-quantized by de-quantization section 22 and decoded into frequency domain signal S˜(f). Although tildes (wavy symbols) “˜” are added over symbols “S” in the accompanying drawings, tildes are added to the right side of symbols “S” in this description because of the limitations of notation. A tilde as used herein indicates a signal obtained as a result of decoding.
Decoded frequency domain signal S˜(f) is transformed to time domain signal S˜(n) by frequency-time transform section 23 using a frequency-time transform method such as Inverse Discrete Fourier Transform (IDFT) or Inverse Modified Discrete Cosine Transform (IMDCT).
(2) Linear Predictive Coding
Linear predictive coding utilizes the predictable nature of speech signals in time domain (the nature that speech signals are predictable in time domain) to obtain a residual signal (or an excitation signal) by applying linear prediction to an inputted speech signal. Especially for a speech signal in a speech range, this linear prediction model can very efficiently represent speech. After the linear prediction, the residual signal is encoded mainly by one of two different methods: TCX (Transform Coded eXcitation) and CELP (Code Excited Linear Prediction). TCX and CELP will be described below.
(2-1) TCX
In TCX (see NPL 2), a residual signal is encoded efficiently in the frequency domain. Examples of TCX codec include 3GPP AMR-WB+ (Extended Adaptive Multi-Rate Wideband) and MPEG USAC (Unified Speech and Audio Coding).
A simple configuration of a TCX codec is illustrated in FIGS. 2A and 2B. In encoder 30 illustrated in FIG. 2A, LPC analysis is performed on input signal S(n) by LPC analysis section 31 to utilize the predictable nature of signals in time domain.
The individual LPC parameters are quantized by quantization section 32, and quantization indexes are outputted to de-quantization section 33 and multiplexing section 37.
The quantization indexes are de-quantized by de-quantization section 33 to reconstruct the LPC parameters.
In addition, LPC inverse filtering using the reconstructed LPC parameters is applied to input signal S(n) by LPC inverse filter section 34, thereby obtaining time domain residual signal Sr(n).
Time domain residual signal Sr(n) is transformed to frequency domain residual signal Sr(f) by time-frequency transform section 35 using the frequency-time transform method such as DFT or MDCT.
Frequency domain residual signal Sr(f) is quantized by quantization section 36, and the individual quantized parameters are outputted to multiplexing section 37.
The quantization indexes outputted from quantization section 32 and the respective quantization parameters outputted from quantization section 36 are multiplexed by multiplexing section 37 and sent to the decoder side as bit-stream information.
In decoder 40 illustrated in FIG. 2B, all the bit-stream information sent from the encoder side is demultiplexed by demultiplexing section 41 into the quantization indexes and the quantization parameters. The demultiplexed quantization indexes are outputted to de-quantization section 44, and the demultiplexed quantization parameters are outputted to de-quantization section 42.
The demultiplexed quantization parameters are de-quantized by de-quantization section 42 and decoded into frequency domain residual signal S˜r(f), and decoded frequency domain residual signal S˜r(f) is transformed to time domain residual signal S˜r(n) by frequency-time transform section 43 using a frequency-time transform method such as IDFT or IMDCT.
On the other hand, the demultiplexed quantization indexes are de-quantized by de-quantization section 44 to obtain the LPC parameters.
Time domain residual signal S˜r(n) is processed using the LPC parameters by LPC synthesis filter section 45 to obtain time domain signal S˜(n).
(2-2) CELP
In CELP, a residual signal is quantized using a prescribed codebook. To further enhance the speech quality, it is often that a difference signal between an original signal and a signal after LPC synthesis is transformed and encoded into frequency domain. Examples of CELP codec include ITU-T G.729.1 (see NPL 3) and ITU-T G.718 (see NPL 4).
A simple configuration of layer coding (or embedded coding) of CELP and transform coding is illustrated in FIGS. 3A and 3B. In encoder 50 illustrated in FIG. 3A, to utilize the predictable nature of signals in time domain, CELP encoding is performed on input signal S(n) by CELP encoding section 51, and CELP parameters are outputted to CELP local decoding section 52 and multiplexing section 55.
The CELP parameters are decoded by CELP local decoding section 52 to obtain synthesized signal Ssyn(n). Prediction error signal Se(n) is obtained by subtracting synthesized signal Ssyn(n) from input signal S(n).
Time domain prediction error signal Se(n) is transformed to frequency domain prediction error signal Se(f) by time-frequency transform section 53 using the frequency-time transform method such as DFT or MDCT.
Frequency domain prediction error signal Se(f) is quantized by quantization section 54, and respective quantization parameters are outputted to multiplexing section 55.
The CELP parameters outputted from CELP encoding section 51 and the respective quantization parameters outputted from quantization section 54 are multiplexed by multiplexing section 55 and sent as bit-stream information to the decoder side.
In decoder 60 illustrated in FIG. 3B, all the bit-stream information sent form the encoder side is demultiplexed by demultiplexing section 61 into the CELP parameters and the individual quantization parameters. The demultiplexed CELP parameters are outputted to CELP decoding section 64, and the demultiplexed quantization parameters are outputted to de-quantization section 62.
The demultiplexed quantization parameters are de-quantized by de-quantization section 62 and decoded into frequency domain prediction error signal S˜e(f), and decoded frequency domain prediction error signal S˜e(f) is transformed to time domain prediction error signal S˜e(n) by frequency-time transform section 63 using the frequency-time transform method such as IDFT or IMDCT.
On the other hand, the demultiplexed CELP parameters are decoded by CELP decoding section 64 to obtain synthesized signal Ssyn(n).
Time domain prediction error signal S˜e(n) is obtained by adding prediction error signal S˜e(n) and synthesized signal Ssyn(n).
(3) Split Multi-Rate Lattice Vector Quantization
Encoding in transform coding and linear prediction coding generally utilizes some kind of quantization methods. One of such quantization methods is split multi-rate lattice vector quantization (hereinafter referred to as “split multi-rate lattice VQ” as appropriate) (or algebraic vector quantization) (see NPL 5).
In AMR-WB+ (see NPL 6), split multi-rate lattice VQ is used to quantize an LPC residual in TCX domain. Also in a newly standardized speech codec ITU-T G.718, split multi-rate lattice VQ is used to quantize an LPC residual in MDCT domain as the third residue coding layer.
Split multi-rate lattice VQ is a vector quantization method based on lattice quantizers. Specifically, in the case of the split multi-rate lattice VQ used in AMR-WB+, spectrum is quantized in blocks of 8 spectral coefficients using vector codebooks including subsets of the Gosset lattice, referred to as RE8 lattice (see NPL 5).
All points of a given lattice can be generated from a so-called square generator matrix G of the lattice, as c=s·G (where s is a line vector with respective integer values and c is the generated lattice point).
To create a vector codebook at a certain rate, only lattice points inside an area (in 8 dimensions) of a given radius are taken. Therefore, multi-rate codebooks are created by taking subsets of lattice points inside areas of different radii.
A simple configuration using split multi-rate lattice VQ in a TCX codec is illustrated in FIGS. 4A and 4B. In encoder 70 illustrated in FIG. 4A, LPC analysis is performed on input signal S(n) by LPC analysis section 71 to utilize the predictable nature of signals in time domain.
The individual LPC parameters generated from the LPC analysis are quantized by quantization section 72, and quantization indexes are outputted to de-quantization section 73 and multiplexing section 77.
The quantization indexes are de-quantized by de-quantization section 73 to reconstruct the LPC parameters.
In addition, LPC inverse filtering using the reconstructed LPC parameters is applied to input signal S(n) by LPC inverse filter section 74, thereby obtaining residual signal Sr(n).
Time domain residual signal Sr(n) is transformed to frequency domain residual signal Sr(f) by time-frequency transform section 75 using the frequency-time transform method such as DFT or MDCT.
Split multi-rate lattice VQ is applied to frequency domain residual signal Sr(f) by split multi-rate lattice VQ section 76, and respective quantized parameters are outputted to multiplexing section 77.
The quantization indexes outputted from quantization section 72 and the respective quantization parameters outputted from split multi-rate lattice VQ section 76 are multiplexed by multiplexing section 77 and sent to the decoder side as bit-stream information.
In decoder 80 illustrated in FIG. 4B, all the bit-stream information sent from the encoder side is demultiplexed by demultiplexing section 81 into the quantization indexes and the quantization parameters.
Split multi-rate lattice inverse VQ is applied to the demultiplexed quantization parameters by split multi-rate lattice inverse VQ section 82 so that the parameters are decoded into frequency domain residual signal S˜r(f), and decoded frequency domain residual signal S˜r(f) is transformed to time domain residual signal S˜r(n) by frequency-time transform section 83 using the frequency-time transform method such as IDFT or IMDCT.
The demultiplexed quantization indexes are de-quantized by de-quantization section 84 to obtain the LPC parameters.
Time domain residual signal S˜r(n) is processed using the LPC parameters by LPC synthesis filter section 85 to obtain time domain signal S˜(n).
FIG. 5 is a block diagram illustrating processing of split multi-rate lattice VQ. In FIG. 5, input spectrum S(f) is divided into some number of 8-dimensional blocks (or vectors) by block dividing section 91, and the divided 8-dimensional blocks are outputted to split multi-rate lattice VQ section 92.
Each of the divided 8-dimensional blocks is quantized by split multi-rate lattice VQ in split multi-rate lattice VQ section 92. In this quantization, first, a global gain is calculated according to the number of available bits and the energy level of the whole spectrum. Then, for each block, the ratio between the original spectrum and the global gain is obtained, and these ratios are quantized by different codebooks.
The obtained individual quantization parameters of split multi-rate lattice VQ are a quantization index of global gain, a codebook indication value for each block, and a code vector index for each block.
FIG. 6 is an overview of codebook list of split multi-rate lattice VQ adopted in AMR-WB+ (see NPL 6). In FIG. 6, codebook Q0, Q2, Q3 or Q4 is a base codebook. When a certain lattice point is not included in these base codebooks, Voronoi extension (see NPL 7) is applied using only Q3 or Q4 part of the base codebooks. For example, in the table, Q5 is Voronoi extension of Q3, and Q6 is Voronoi extension of Q4.
Each codebook consists of a certain number of code vectors, and a code vector index in the codebook is represented by a certain number of bits. This number of bits is obtained by equation 1 as follows:Nbits=log2(Ncv)  (Equation 1)
In equation 1, Nbits denotes the number of bits used to represent a code vector index, and Ncv denotes the number of code vectors in a codebook.
In codebook Q0, there is only one vector, the null vector, which means that the quantized value of the vector is 0. Therefore, there are no bits required for the code vector index.
There are two possible methods for forming a bit-stream from a set of three quantization parameters generated by split multi-rate lattice VQ: a global gain's index, a codebook indication value, and a code vector's index. The first bit-stream forming method is illustrated in FIG. 7, and the second bit-stream forming method is illustrated in FIG. 8. A case where an input spectrum is divided into 6 blocks (v0 to v5) is illustrated here.
In the first bit-stream forming method, global gain G is quantized by a scalar quantizer (Q in FIG. 7) first. S(f)/G for each divided block is quantized by a multi-rate lattice vector quantizer (VQ in FIG. 7). As illustrated in FIG. 7, the quantized global gain's index is arranged in the first region at the head of a bit-stream. Then, codebook indication values (Cb1 to Cb5) are arranged in the second region from the head side in ascending order of the block number, and following the second region, code vector's indexes are arranged in the third region from the head side in ascending order of the block number.
In the second bit-stream forming method, global gain G is quantized by a scalar quantizer (Q in FIG. 8) first. S(f)/G for each divided block is quantized by a multi-rate lattice vector quantizer (VQ in FIG. 8). As illustrated in FIG. 8, the quantized global gain's index is arranged in the first region at the head of a bit-stream. Then, for each vector, a set of a codebook indication value and a code vector index is arranged in the second to seventh regions following the first region for each vector.