Traditionally, all parametric speech coding methods make use of the redundancy inherent in the speech signal to reduce the amount of information that must be sent and to estimate the parameters of speech samples of a signal at short intervals. This redundancy primarily arises from the repetition of speech wave shapes at a quasi-periodic rate, and the slow changing spectral envelop of speech signal.
The redundancy of speech waveforms may be considered with respect to several different types of speech signal, such as voiced and unvoiced. For voiced speech, the speech signal is essentially periodic; however, this periodicity may be variable over the duration of a speech segment and the shape of the periodic wave usually changes gradually from segment to segment. A low bit rate speech coding could greatly benefit from exploring such periodicity. The voiced speech period is also called pitch, and pitch prediction is often named Long-Term Prediction (LTP). As for unvoiced speech, the signal is more like a random noise and has a smaller amount of predictability.
In either case, parametric coding may be used to reduce the redundancy of the speech segments by separating the excitation component of speech signal from the spectral envelope component. The slowly changing spectral envelope can be represented by Linear Prediction Coding (LPC), also known as Short-Term Prediction (STP). A low bit rate speech coding could also benefit from exploring such a Short-Term Prediction. The coding advantage arises from the slow rate at which the parameters change. Yet, it is rare for the parameters to be significantly different from the values held within a few milliseconds. Accordingly, at the sampling rate of 8 kHz, 12.8 kHz or 16 kHz, the speech coding algorithm is such that the nominal frame duration is in the range of ten to thirty milliseconds, where a frame duration of twenty milliseconds is most common. In more recent well-known standards such as G.723.1, G.729, G.718, EFR, SMV, AMR, VMR-WB or AMR-WB, the Code Excited Linear Prediction Technique (“CELP”) has been adopted, which is commonly understood as a technical combination of Coded Excitation, Long-Term Prediction and Short-Term Prediction. Code-Excited Linear Prediction (CELP) Speech Coding is a very popular algorithm principle in speech compression area although the details of CELP for different CODECs differ significantly.
FIG. 1 illustrates a conventional CELP encoder where weighted error 109 between synthesized speech 102 and original speech 101 is minimized often by using a so-called analysis-by-synthesis approach. W(z) is an error weighting filter 110, 1/B(z) is a long-term linear prediction filter 105, and 1/A(z) is a short-term linear prediction filter 103. The coded excitation 108, which is also called fixed codebook excitation, is scaled by gain Gc 106 before going through the linear filters. The short-term linear filter 103 is obtained by analyzing the original signal 101 and represented by a set of coefficients:
                                          A            ⁡                          (              z              )                                =                                                    ∑                                  i                  =                  1                                P                            ⁢              1                        +                                          a                i                            ·                              z                                  -                  i                                                                    ,                  i          =          1                ,        2        ,        …        ⁢                                  ,                  P          .                                    (        1        )            
The weighting filter 110 is somehow related to the above short-term prediction filter. A typical form of the weighting filter is:
                                          W            ⁡                          (              z              )                                =                                    A              ⁡                              (                                  z                  /                  α                                )                                                    A              ⁡                              (                                  z                  /                  β                                )                                                    ,                            (        2        )            where β<α, 0<β<1, 0<α≦1. In the standard codec ITU-T G.718, the perceptual weighting filter has the following form:
                                          W            ⁡                          (              z              )                                =                                                    A                ⁡                                  (                                      z                    /                                          γ                      1                                                        )                                            ⁢                                                H                                      de                    -                    emph                                                  ⁡                                  (                  z                  )                                                      =                                          A                ⁡                                  (                                      z                    /                                          γ                      1                                                        )                                            /                              (                                  1                  -                                                            β                      1                                        ⁢                                          z                                              -                        1                                                                                            )                                                    ,                                  ⁢        where        ,                            (        3        )                                                      H                          de              -              emph                                ⁡                      (            z            )                          =                  1                      1            -                                          β                1                            ⁢                              z                                  -                  1                                                                                        (        4        )            and β1 is equal to 0.68.
The long-term prediction 105 depends on pitch and pitch gain. A pitch may be estimated, for example, from the original signal, residual signal, or weighted original signal. The long-term prediction function in principal may be expressed asB(z)=1−β·z−Pitch.  (5)
The coded excitation 108 normally comprises a pulse-like signal or noise-like signal, which are mathematically constructed or saved in a codebook. Finally, the coded excitation index, quantized gain index, quantized long-term prediction parameter index, and quantized short-term prediction parameter index are transmitted to the decoder.
FIG. 2 illustrates an initial decoder that adds a post-processing block 207 after synthesized speech 206. The decoder is a combination of several blocks that are coded excitation 201, excitation gain 202, long-term prediction 203, short-term prediction 205 and post-processing 207. Every block except post-processing block 207 has the same definition as described in the encoder of FIG. 1. Post-processing block 207 may also include short-term post-processing and long-term post-processing.
FIG. 3 shows a basic CELP encoder that realizes the long-term linear prediction by using adaptive codebook 307 containing a past synthesized excitation 304 or repeating past excitation pitch cycle at pitch period. Pitch lag may be encoded in integer value when it is large or long and pitch lag may be encoded in more precise fractional value when it is small or short. The periodic information of pitch is employed to generate the adaptive component of the excitation. This excitation component is then scaled by gain Gp 305 (also called pitch gain). The second excitation component is generated by coded-excitation block 308, which is scaled by gain Gc 306. Gc is also referred to as fixed codebook gain, since the coded-excitation often comes from a fixed codebook. The two scaled excitation components are added together before going through the short-term linear prediction filter 303. The two gains (Gp and Gc) are quantized and then sent to a decoder.
FIG. 4 illustrates a conventional decoder corresponding to the encoder in FIG. 3, which adds a post-processing block 408 after a synthesized speech 407. This decoder is similar to FIG. 2 with the addition of adaptive codebook 307. The decoder is a combination of several blocks, which are coded excitation 402, adaptive codebook 401, short-term prediction 406, and post-processing 408. Every block except post-processing block 408 has the same definition as described in the encoder of FIG. 3. Post-processing block 408 may further include of short-term post-processing and long-term post-processing.
Long-Term Prediction plays very important role for voiced speech coding because voiced speech has a strong periodicity. The adjacent pitch cycles of voiced speech are similar each other, which means mathematically that pitch gain Gp in the following excitation expression is high or close to 1,e(n)=Gp·ep(n)+Gc·ec(n),  (6)where ep(n) is one subframe of sample series indexed by n, coming from the adaptive codebook 307 which comprises the past excitation 304; ep(n) may be adaptively low-pass filtered as low frequency area is often more periodic or more harmonic than high frequency area; ec(n) is from the coded excitation codebook 308 (also called fixed codebook) which is a current excitation contribution; and ec(n) may also be enhanced using high pass filtering enhancement, pitch enhancement, dispersion enhancement, formant enhancement, and the like. For voiced speech, the contribution of ep(n) from the adaptive codebook may be dominant and the pitch gain Gp 305 may be a value of about 1. The excitation is usually updated for each subframe. A typical frame size is 20 milliseconds and typical subframe size is 5 milliseconds.