Low rate coding applications, such as digital speech, typically employ techniques, such as a Linear Predictive Coding (LPC), to model the spectra of short-term speech signals. Coding systems employing an LPC technique provide prediction residual signals for corrections to characteristics of a short-term model. One such coding system is a speech coding system known as Code Excited Linear Prediction (CELP) that produces high quality synthesized speech at low bit rates, that is, at bit rates of 4.8 to 9.6 kilobits-per-second (kbps). This class of speech coding, also known as vector-excited linear prediction or stochastic coding, is used in numerous speech communications and speech synthesis applications. CELP is also particularly applicable to digital speech encryption and digital radiotelephone communication systems wherein speech quality, data rate, size, and cost are significant issues.
A CELP speech coder that implements an LPC coding technique typically employs long-term (pitch) and short-term (formant) predictors that model the characteristics of an input speech signal and that are incorporated in a set of time-varying linear filters. An excitation signal, or codevector, for the filters is chosen from a codebook of stored codevectors. For each frame of speech, the speech coder applies the codevector to the filters to generate a reconstructed speech signal, and compares the original input speech signal to the reconstructed signal to create an error signal. The error signal is then weighted by passing the error signal through a perceptual weighting filter having a response based on human auditory perception. An optimum excitation signal is then determined by selecting one or more codevectors that produce a weighted error signal with a minimum energy (error value) for the current frame. Typically the frame is partitioned into two or more contiguous subframes. The short-term predictor parameters are usually determined once per frame and are updated at each subframe by interpolating between the short-term predictor parameters for the current frame and the previous frame. The excitation signal parameters are typically determined for each subframe.
For example, FIG. 1 is a block diagram of a CELP coder 100 of the prior art. In CELP coder 100, an input signal s(n) is applied to a linear predictive (LP) analyzer 101, where linear predictive coding is used to estimate a short-term spectral envelope. The resulting spectral coefficients (or linear prediction (LP) coefficients) are denoted by the transfer function A(z). The spectral coefficients are applied to an LP quantizer 102 that quantizes the spectral coefficients to produce quantized spectral coefficients Aq that are suitable for use in a multiplexer 109. The quantized spectral coefficients Aq are then conveyed to multiplexer 109, and the multiplexer produces a coded bitstream based on the quantized spectral coefficients and a set of excitation vector-related parameters L, βi's, I, and γ, that are determined by a squared error minimization/parameter quantization block 108. As a result, for each block of speech, a corresponding set of excitation vector-related parameters is produced, which includes multi-tap long-term predictor (LTP) parameters (lag L and multi-tap predictor coefficients βi's), and fixed codebook parameters (index I and scale factor γ).
The quantized spectral parameters are also conveyed locally to an LP synthesis filter 105 that has a corresponding transfer function 1/Aq(z). LP synthesis filter 105 also receives a combined excitation signal ex(n) and produces an estimate of the input signal ŝ(n) based on the quantized spectral coefficients Aq and the combined excitation signal ex(n). Combined excitation signal ex(n) is produced as follows. A fixed codebook (FCB) codevector, or excitation vector, {tilde over (c)}1 is selected from a fixed codebook (FCB) 103 based on a fixed codebook index parameter I. The FCB codevector {tilde over (c)}1 is then scaled based on the gain parameter γ and the scaled fixed codebook codevector is conveyed to a multitap long-term predictor (LTP) filter 104. Multi-tap LTP filter 104 has a corresponding transfer function
                              1                      (                          1              -                                                ∑                                      i                    =                                          -                                              K                        1                                                                                                  K                    2                                                  ⁢                                                                  ⁢                                                      β                    i                                    ⁢                                      z                                                                  -                        L                                            +                      i                                                                                            )                          ,                              K            1                    ≥          0                ,                              K            2                    ≥          0                ,                  K          =                      1            +                          K              1                        +                          K              2                                                          (        1        )            wherein K is the LTP filter order (typically between 1 and 3, inclusive) and, βi's and L are excitation vector-related parameters that are conveyed to the filter by squared error minimization/parameter quantization block 108. In the above definition of the LTP filter transfer function, L is an integer value specifying the delay in number of samples. This form of LTP filter transfer function is described in a paper by Bishnu S. Atal, “Predictive Coding of Speech at Low Bit Rates,” IEEE Transactions on Communications, VOL. COM-30, NO. 4, April 1982, pp. 600-614 (hereafter referred to as Atal) and in a paper by Ravi P. Ramachandran and Peter Kabal, “Pitch Prediction Filters in Speech Coding,” IEEE Transactions on Acoustics, Speech, and Signal Processing, VOL. 37, NO. 4, April 1989, pp. 467-478 (hereafter referred to as Ramachandran et. al.). Filter 104 filters the scaled fixed codebook codevector received from FCB 103 to produce the combined excitation signal ex(n) and conveys the excitation signal to LP synthesis filter 105.
LP synthesis filter 105 conveys the input signal estimate ŝ(n) to a combiner 106. Combiner 106 also receives input signal s(n) and subtracts the estimate of the input signal ŝ(n) from the input signal s(n). The difference between input signal s(n) and input signal estimate ŝ(n) is applied to a perceptual error weighting filter 107, which filter produces a perceptually weighted error signal e(n) based on the difference between ŝ(n) and s(n) and a weighting function W(z). Perceptually weighted error signal e(n) is then conveyed to squared error minimization/parameter quantization block 108. Squared error minimization/parameter quantization block 108 uses the error signal e(n) to determine an error value E (typically
            E      =                        ∑          n                                                ⁢                                  ⁢                              e            2                    ⁡                      (            n            )                                )    ,and subsequently, an optimal set of excitation vector-related parameters L, βi's, I, and γ that produce the best estimate ŝ(n) of the input signal s(n) based on the minimization of E. The quantized LP coefficients and the optimal set of parameters L, βi's, I, and γ are then conveyed over a communication channel to a receiving communication device, where a speech synthesizer uses the LP coefficients and excitation vector-related parameters to reconstruct the estimate of the input speech signal ŝ(n). An alternate use may involve efficient storage to an electronic or electromechanical device, such as a computer hard disk.
In a CELP coder such as coder 100, a synthesis function for generating the CELP coder combined excitation signal ex(n) is given by the following generalized difference equation:
                                          ex            ⁡                          (              n              )                                =                                    γ              ⁢                                                          ⁢                                                                    c                    ~                                    I                                ⁡                                  (                  n                  )                                                      +                                          ∑                                  i                  =                                      -                                          K                      1                                                                                        K                  2                                            ⁢                                                          ⁢                                                β                  i                                ⁢                                  ex                  ⁡                                      (                                          n                      -                      L                      +                      i                                        )                                                                                      ,                                  ⁢                  n          =          0                ,        …        ⁢                                  ,                  N          -          1                ,                              K            1                    ≥          0                ,                              K            2                    ≥          0                                    (                  1          ⁢          a                )            where ex(n) is a synthetic combined excitation signal for a subframe, {tilde over (c)}1(n) is a codevector, or excitation vector, selected from a codebook, such as FCB 103, I is an index parameter, or codeword, specifying the selected codevector, γ is the gain for scaling the codevector, ex(n−L+i) is a synthetic combined excitation signal delayed by L (integer resolution) samples relative to the (n+i)-th sample of the current subframe (for voiced speech L is typically related to the pitch period), βi's are the long term predictor (LTP) filter coefficients, and N is the number of samples in the subframe. When n−L+i<0, ex(n−L+i) contains the history of past synthetic excitation, constructed as shown in eqn. (1a). That is, for n−L+i<0, the expression ‘ex(n−L+i)’ corresponds to an excitation sample constructed prior to the current subframe, which excitation sample has been delayed and scaled pursuant to an LTP filter transfer function
                              1                      1            -                                          ∑                                  i                  =                                      -                                          K                      1                                                                                        K                  2                                            ⁢                                                          ⁢                                                β                  i                                ⁢                                  z                                                            -                      L                                        +                    i                                                                                      ,                              K            1                    ≥          0                ,                              K            2                    ≥          0                ,                  K          =                      1            +                          K              1                        +                          K              2                                                          (        2        )            
The task of a typical CELP speech coder such as coder 100 is to select the parameters specifying the synthetic excitation, that is, the parameters L, βi's, I, γ in coder 100, given ex(n) for n<0 and the determined coefficients of short-term Linear Predictor (LP) filter 105, so that when the synthetic excitation sequence ex(n) for 0≦n<N is filtered through LP filter 105, the resulting synthesized speech signal ŝ(n) most closely approximates, according to a distortion criterion employed, the input speech signal s(n) to be coded for that subframe.
When the LTP filter order K>1, the LTP filter as defined in eqn. (1) is a multi-tap filter. A conventional integer-sample resolution delay multi-tap LTP filter, as described, seeks to predict a given sample as a weighted sum of K, usually adjacent, delayed samples, where the delay is confined to a range of expected pitch period values (typically between 20 and 147 samples at 8 kHz signal sampling rate). An integer-sample resolution delay (L) multi-tap LTP filter has the ability to implicitly model non-integer values of delay while simultaneously providing spectral shaping (Atal, Ramachandran et. al.). A multi-tap LTP filter requires quantization of the K unique βi coefficients, in addition to L. If K=1, a 1st order LTP filter results, requiring quantization of only a single β0 coefficient and L. However, a 1st order LTP filter, using integer-sample resolution delay L, does not have the ability to implicitly model non-integer delay value, other than rounding it to the nearest integer or an integer multiple of a non-integral delay. Neither does it provide spectral shaping. Nevertheless, 1st order LTP filter implementations have been commonly used, because only two parameters—L and β need to be quantized, a consideration for many low-bit rate speech coder implementations.
The introduction of the 1st order LTP filter, using a sub-sample resolution delay, significantly advanced the state-of-the-art of LTP filter design. This technique is described in U.S. Pat. No. 5,359,696, “Digital Speech Coder Having Improved Sub-sample Resolution Long-Term Predictor,” by Ira A. Gerson and Mark A. Jasiuk (thereafter referred to as Gerson et. al.) and also in a textbook chapter by Peter Kroon and Bishnu S. Atal, “On Improving the Performance of Pitch Predictors in Speech Coding Systems,” Advances in Speech Coding, Kluwer Academic Publishers, 1991, Chapter 30, pp. 321-327 (thereafter referred to as Kroon et. al). Using this technique, the value of delay is explicitly represented with sub-sample resolution, redefined here as {circumflex over (L)}. Samples delayed by {circumflex over (L)} may be obtained by using an interpolation filter. To compute samples delayed by values of {circumflex over (L)} having different fractional parts, the interpolation filter phase that provides the closest representation of the desired fractional part may be selected to generate the sub-sample resolution delayed sample by filtering using the interpolation filter coefficients corresponding to the selected phase of the interpolation filter. Such a 1st order LTP filter, which explicitly uses a sub-sample resolution delay, is able to provide predicted samples with sub-sample resolution, but lacks the ability to provide spectral shaping. Nevertheless, it has been shown (Kroon et. al.) that a 1st order LTP filter, with a sub-sample resolution delay, can more efficiently remove the long-term signal correlation than a conventional integer-sample resolution delay multi-tap LTP filter. Being a 1st order LTP filter, only two parameters need to be conveyed from the encoder to the decoder: β and {circumflex over (L)}, resulting in improved quantization efficiency relative to integer-resolution delay multi-tap LTP filter, which requires quantization of L, and K unique βi coefficients. Consequently, the 1st order sub-sample resolution form of the LTP filter is the most widely used in current CELP-type speech coding algorithms. The LTP filter transfer function for this filter is given by
                    1                  1          -                      β            ⁢                                                  ⁢                          z                              -                                  L                  ^                                                                                        (        3        )            with the corresponding difference equation given by:
Implicit in equations (3) and (4) is the use of an interpolation filter to compute samples pointed to by the sub-sample resolution delay {circumflex over (L)}.
FIG. 2 shows the inherent differences between the multi-tap LTP (shown in FIG. 1), and the LTP with sub-sample resolution, as described above. In coder 200, LTP 204 requires only two parameters (β, {circumflex over (L)}) from the error minimization/parameter quantization block 208, which subsequently conveys parameters {circumflex over (L)}, β, I, γ to multiplexer 109.
Note that in describing the LTP filter, a generalized form of the LTP filter transfer function has been given. ex(n) for values of n<0 contains the LTP filter state. For values of L or {circumflex over (L)} which necessitate access to samples of n, for n≧0, when evaluating ex(n) in eqn. (1) or (4), a simplified and non-equivalent form for the LTP filter is often used called a virtual codebook or an adaptive codebook (ACB), which will be later described in more detail. This technique is described in U.S. Pat. No. 4,910,781 by Richard H. Ketchum, Willem B. Kleijn, and Daniel J. Krasinski, titled “Code Excited Linear Predictive Vocoder Using Virtual Searching,” (hereafter referred to as Ketchum et. al.). The term “LTP filter,” strictly speaking, refers to a direct implementation of eqn. (1a) or (4), but as used in this application it may also refer to an ACB implementation of the LTP filter. In the instances when this distinction is important to the description of the prior art and the current invention, it will explicitly be made.
The graphical representation of an ACB implementation can be seen in FIG. 3. When the value of the sub-sample resolution filter delay {circumflex over (L)} is greater than the subframe length N, FIGS. 2 and 3 are generally equivalent. In this case, the ACB memory 310 and LTP filter 204 memory contain essentially the same data. When the filter delay is less than the length of a subframe, however, the scaled FCB excitation and LTP filter memory are re-circulated through the LTP memory 204 and are subject to recursive scaling iterations by the β coefficient. In the ACB implementation 310, the ACB vector is circulated using a unity gain long-term filter of the form:ex(n)=ex(n−{circumflex over (L)}), 0≦n<N  (4a)and then letting c0(n)=ex(n), 0≦n<N, which is subsequently scaled by a single, non-recursive instance of the β coefficient.
Considering the two methods of implementing an LTP filter, which were discussed; i.e., an integer-resolution delay multi-tap LTP filter and a 1st order sub-sample resolution delay LTP filter, each capable of being implemented directly (100, 200) or via the ACB method (300), the following observations can be made:
The conventional multi-tap predictor performs two tasks simultaneously: spectral shaping and implicit modeling of a non-integer delay through generating a predicted sample as a weighted sum of samples used for the prediction (Atal et. al., and Ramachandran et. al.). In the conventional multi-tap LTP filter, the two tasks—spectral shaping and the implicit modeling of non-integer delay—are not efficiently modeled together. For example, a 3rd order multi-tap LTP filter, if no spectral shaping for a given subframe is required, would implicitly model the delay with non-integer resolution. However, the order of such a filter is not sufficiently high to provide a high quality interpolated sample value.
The 1st order sub-sample resolution LTP filter, on the other hand, can explicitly use a fractional part of the delay to select a phase of an interpolating filter of arbitrary order and thus very high quality. This method, where the sub-sample resolution delay is explicitly defined and used, provides a very efficient way of representing interpolation filter coefficients. Those coefficients do not need to be explicitly quantized and transmitted, but may instead be inferred from the delay received, where that delay is specified with sub-sample resolution. While such a filter does not have the ability to introduce spectral shaping, for voiced (quasi-periodic) speech it has been found that the effect of defining the delay with sub-sample resolution is more important than the ability to introduce spectral shaping (Kroon et. al.). These are some of the reasons why a 1st order LTP filter, with sub-sample resolution delay, can be more efficient than a conventional multi-tap LTP filter, and is widely used in numerous industry standards.
While a sub-sample resolution 1st order LTP filter provides a very efficient model for an LTP filter, it may be desirable to provide a mechanism to do spectral shaping, a property which a sub-sample resolution 1st order LTP filter lacks. The speech signal harmonic structure tends to weaken at higher frequencies. This effect becomes more pronounced for wideband speech coding systems, characterized by increased signal bandwidth (relative to narrow-band signals). In wideband speech coding systems, a signal bandwidth of up to 8 kHz may be achieved (given 16 kHz. sampling frequency) compared to the 4 kHz maximum achievable bandwidth for narrow-band speech coding systems (given 8 kHz sampling frequency). One method of adding spectral shaping is described in the Patent WO 00/25298 by Bruno Bessette, Redwan Salami, and Roch Lefebvre, titled “Pitch Search in Coding Wideband Signals,” (thereafter referred to as Bessette et. al.). This approach, as depicted in FIG. 4, stipulates provision of at least two spectral shaping filters (420) to select from (one of which may have a unity transfer function), and requires that the LTP vector be explicitly filtered by the spectral shaping filter being evaluated. An alternate implementation of this approach is also described, whereby at least two distinct interpolation filters are provided, each having distinct spectral shaping. In either of those two implementations, the filtered version of the LTP vector is then used to generate a distortion metric, which is evaluated (408) to select which of the at least two spectral shaping filters to use (421), in conjunction with the LTP filter parameters. Although this technique does provide the means to vary spectral shaping, it requires that a spectrally shaped version of the LTP vector be explicitly generated prior to the computation of the distortion metric corresponding to that LTP vector and spectral shaping filter combination. If a large set of spectral shaping filters is provided to select from, this may result in appreciable increase in complexity due to the filtering operations. Also, the information related to the selected filter, such as an index m, needs to be quantized and conveyed from the encoder (via multiplexer 109) to the decoder.
Therefore, a need exists for a method and apparatus for speech coding that is capable of efficiently modeling (with low complexity) the non-integral values of delay as well as having an ability to provide spectral shaping.