1. Field of the Invention
The present invention relates to an improved technique for digitally encoding a sound signal, in particular but not exclusively a speech signal, in view of transmitting and synthesizing this sound signal. More specifically, the present invention is concerned with a method and device for vector quantizing linear prediction parameters in variable bit rate linear prediction based coding.
2. Brief Description of the Prior Techniques
2.1 Speech Coding and Quantization of Linear Prediction (LP) Parameters:
Digital voice communication systems such as wireless systems use speech encoders to increase capacity while maintaining high voice quality. A speech encoder converts a speech signal into a digital bitstream which is transmitted over a communication channel or stored in a storage medium. The speech signal is digitized, that is, sampled and quantized with usually 16-bits per sample. The speech encoder has the role of representing these digital samples with a smaller number of bits while maintaining a good subjective speech quality. The speech decoder or synthesizer operates on the transmitted or stored bit stream and converts it back to a sound signal.
Digital speech coding methods based on linear prediction analysis have been very successful in low bit rate speech coding. In particular, code-excited linear prediction (CELP) coding is one of the best known techniques for achieving a good compromise between the subjective quality and bit rate. This coding technique is the basis of several speech coding standards both in wireless and wireline applications. In CELP coding, the sampled speech signal is processed in successive blocks of N samples usually called frames, where N is a predetermined number corresponding typically to 10–30 ms. A linear prediction (LP) filter A(z) is computed, encoded, and transmitted every frame. The computation of the LP filter A(z) typically needs a lookahead, which consists of a 5–15 ms speech segment from the subsequent frame. The N-sample frame is divided into smaller blocks called subframes. Usually the number of subframes is three or four resulting in 4–10 ms subframes. In each subframe, an excitation signal is usually obtained from two components, the past excitation and the innovative, fixed-codebook excitation. The component formed from the past excitation is often referred to as the adaptive codebook or pitch excitation. The parameters characterizing the excitation signal are coded and transmitted to the decoder, where the reconstructed excitation signal is used as the input of a LP synthesis filter.
The LP synthesis filter is given by
      H    ⁡          (      z      )        =            1              A        ⁡                  (          z          )                      =          1              1        +                              ∑                          i              =              1                        M                    ⁢                                    a              i                        ⁢                          z                              -                i                                                        where αi are linear prediction coefficients and M is the order of the LP analysis. The LP synthesis filter models the spectral envelope of the speech signal. At the decoder, the speech signal is reconstructed by filtering the decoded excitation through the LP synthesis filter.
The set of linear prediction coefficients αi are computed such that the prediction errore(n)=s(n)−{tilde over (s)}(n)  (1)is minimized, where s(n) is the input signal at time n and {tilde over (s)}(n) is the predicted signal based on the last M samples given by:
            s      ~        ⁡          (      n      )        =      -                  ∑                  i          =          1                M            ⁢                        a          i                ⁢                  s          ⁡                      (                          n              -              i                        )                              Thus the prediction error is given by:
      e    ⁡          (      n      )        =            s      ⁡              (        n        )              +                  ∑                  i          =          1                M            ⁢                        a          i                ⁢                  s          ⁡                      (                          n              -              i                        )                              This corresponds in the z-tranform domain to:E(z)=S(z)A(z)where A(z) is the LP filter of order M given by:
      A    ⁡          (      z      )        =      1    +                  ∑                  i          =          1                M            ⁢                        a          i                ⁢                  z                      -            i                              Typically, the linear prediction coefficients αi are computed by minimizing the mean-squared prediction error over a block of L samples, L being an integer usually equal to or larger than N (L usually corresponds to 20–30 ms). The computation of linear prediction coefficients is otherwise well known to those of ordinary skill in the art. An example of such computation is given in [ITU-T Recommendation G.722.2 “Wideband coding of speech at around 16 kbit/s using adaptive multi-rate wideband (AMR-WB)”, Geneva, 2002].
The linear prediction coefficients αi cannot be directly quantized for transmission to the decoder. The reason is that small quantization errors on the linear prediction coefficients can produce large spectral errors in the transfer function of the LP filter, and can even cause filter instabilities. Hence, a transformation is applied to the linear prediction coefficients αi prior to quantization. The transformation yields what is called a representation of the linear prediction coefficients αi. After receiving the quantized transformed linear prediction coefficients αi, the decoder can then apply the inverse transformation to obtain the quantized linear prediction coefficients. One widely used representation for the linear prediction coefficients αi is the line spectral frequencies (LSF) also known as line spectral pairs (LSP). Details of the computation of the Line Spectral Frequencies can be found in [ITU-T Recommendation G.729 “Coding of speech at 8 kbit/s using conjugate-structure algebraic-code-excited linear prediction (CS-ACELP),” Geneva, March 1996].
A similar representation is the Immitance Spectral Frequencies (ISF), which has been used in the AMR-WB coding standard [ITU-T Recommendation G.722.2 “Wideband coding of speech at around 16 kbit/s using Adaptive Multi-Rate Wideband (AMR-WB)”, Geneva, 2002]. Other representations are also possible and have been used. Without loss of generality, the particular case of ISF representation will be considered in the following description.
The so obtained LP parameters (LSFs, ISFs, etc.), are quantized either with scalar quantization (SQ) or vector quantization (VQ). In scalar quantization, the LP parameters are quantized individually and usually 3 or 4 bits per parameter are required. In vector quantization, the LP parameters are grouped in a vector and quantized as an entity. A codebook, or a table, containing the set of quantized vectors is stored. The quantizer searches the codebook for the codebook entry that is closest to the input vector according to a certain distance measure. The index of the selected quantized vector is transmitted to the decoder. Vector quantization gives better performance than scalar quantization but at the expense of increased complexity and memory requirements.
Structured vector quantization is usually used to reduce the complexity and storage requirements of VQ. In split-VQ, the LP parameter vector is split into at least two subvectors which are quantized individually. In multistage VQ the quantized vector is the addition of entries from several codebooks. Both split VQ and multistage VQ result in reduced memory and complexity while maintaining good quantization performance. Furthermore, an interesting approach is to combine multistage and split VQ to further reduce the complexity and memory requirement. In reference [ITU-T Recommendation G.729 “Coding of speech at 8 kbit/s using conjugate-structure algebraic-code-excited linear prediction (CS-ACELP),” Geneva, March 1996], the LP parameter vector is quantized in two stages where the second stage vector is split in two subvectors.
The LP parameters exhibit strong correlation between successive frames and this is usually exploited by the use of predictive quantization to improve the performance. In predictive vector quantization, a predicted LP parameter vector is computed based on information from past frames. Then the predicted vector is removed from the input vector and the prediction error is vector quantized. Two kinds of prediction are usually used: auto-regressive (AR) prediction and moving average (MA) prediction. In AR prediction the predicted vector is computed as a combination of quantized vectors from past frames. In MA prediction, the predicted vector is computed as a combination of the prediction error vectors from past frames. AR prediction yields better performance. However, AR prediction is not robust to frame loss conditions which are encountered in wireless and packet-based communication systems. In case of lost frames, the error propagates to consecutive frames since the prediction is based on previous corrupted frames.
2.2 Variable Bit-rate (VBR) Coding:
In several communications systems, for example wireless systems using code division multiple access (CDMA) technology, the use of source-controlled variable bit rate (VBR) speech coding significantly improves the capacity of the system. In source-controlled VBR coding, the encoder can operate at several bit rates, and a rate selection module is used to determine the bit rate used for coding each speech frame based on the nature of the speech frame, for example voiced, unvoiced, transient, background noise, etc. The goal is to attain the best speech quality at a given average bit rate, also referred to as average data rate (ADR). The encoder is also capable of operating in accordance with different modes of operation by tuning the rate selection module to attain different ADRs for the different modes, where the performance of the encoder improves with increasing ADR. This provides the encoder with a mechanism of trade-off between speech quality and system capacity. In CDMA systems, for example CDMA-one and CDMA2000, typically 4 bit rates are used and are referred to as full-rate (FR), half-rate (HR), quarter-rate (QR), and eighth-rate (ER). In this CDMA system, two sets of rates are supported and referred to as Rate Set I and Rate Set II. In Rate Set II, a variable-rate encoder with rate selection mechanism operates at source-coding bit rates of 13.3 (FR), 6.2 (HR), 2.7 (QR), and 1.0 (ER) kbit/s, corresponding to gross bit rates of 14.4, 7.2, 3.6, and 1.8 kbit/s (with some bits added for error detection).
A wideband codec known as adaptive multi-rate wideband (AMR-WB) speech codec was recently selected by the ITU-T (International Telecommunications Union—Telecommunication Standardization Sector) for several wideband speech telephony and services and by 3GPP (Third Generation Partnership Project) for GSM and W-CDMA (Wideband Code Division Multiple Access) third generation wireless systems. An AMR-WB codec consists of nine bit rates in the range from 6.6 to 23.85 kbit/s. Designing an AMR-WB-based source controlled VBR codec for CDMA2000 system has the advantage of enabling interoperation between CDMA2000 and other systems using an AMR-WB codec. The AMR-WB bit rate of 12.65 kbit/s is the closest rate that can fit in the 13.3 kbit/s full-rate of CDMA2000 Rate Set II. The rate of 12.65 kbit/s can be used as the common rate between a CDMA2000 wideband VBR codec and an AMR-WB codec to enable interoperability without transcoding, which degrades speech quality. Half-rate at 6.2 kbit/s has to be added to enable efficient operation in the Rate Set II framework. The resulting codec can operate in few CDMA2000-specific modes, and incorporates a mode that enables interoperability with systems using a AMR-WB codec.
Half-rate encoding is typically chosen in frames where the input speech signal is stationary. The bit savings, compared to full-rate, are achieved by updating encoding parameters less frequently or by using fewer bits to encode some of these encoding parameters. More specifically, in stationary voiced segments, the pitch information is encoded only once a frame, and fewer bits are used for representing the fixed codebook parameters and the linear prediction coefficients.
Since predictive VQ with MA prediction is typically applied to encode the linear prediction coefficients, an unnecessary increase in quantization noise can be observed in these linear prediction coefficients. MA prediction, as opposed to AR prediction, is used to increase the robustness to frame losses; however, in stationary frames the linear prediction coefficients evolve slowly so that using AR prediction in this particular case would have a smaller impact on error propagation in the case of lost frames. This can be seen by observing that, in the case of missing frames, most decoders apply a concealment procedure which essentially extrapolates the linear prediction coefficients of the last frame. If the missing frame is stationary voiced, this extrapolation produces values very similar to the actually transmitted, but not received, LP parameters. The reconstructed LP parameter vector is thus close to what would have been decoded if the frame had not been lost. In this specific case, therefore, using AR prediction in the quantization procedure of the linear prediction coefficients cannot have a very adverse effect on quantization error propagation.