Traditionally, all parametric speech coding methods make use of the redundancy inherent in the speech signal to reduce the amount of information that must be sent and to estimate the parameters of speech samples of a signal at short intervals. This redundancy primarily arises from the repetition of speech wave shapes at a quasi-periodic rate, and the slow changing spectral envelope of speech signal.
The redundancy of speech waveforms may be considered with respect to several different types of speech signal, such as voiced and unvoiced. For voiced speech, the speech signal is essentially periodic; however, this periodicity may be variable over the duration of a speech segment and the shape of the periodic wave usually changes gradually from segment to segment. A low bit rate speech coding could greatly benefit from exploring such periodicity. The voiced speech period is also called pitch and pitch prediction is often named Long-Term Prediction. As for the unvoiced speech, the signal is more like a random noise and has a smaller amount of predictability.
In either case, parametric coding may be used to reduce the redundancy of the speech segments by separating the excitation component of the speech from the spectral envelope component. The slowly changing spectral envelope can be represented by Linear Prediction (also called Short-Term Prediction). A low bit rate speech coding could also benefit a lot from exploring such a Short-Term Prediction. The coding advantage arises from the slow rate at which the parameters change. Yet, it is rare for the parameters to be significantly different from the values held within a few milliseconds. Accordingly, at the sampling rate of 8 kilohertz (kHz) or 16 kHz, the speech coding algorithm is such that the nominal frame duration is in the range of ten to thirty milliseconds. A frame duration of twenty milliseconds seems to be the most common choice. In more recent well-known standards such as G.723.1, G.729, enhanced full rate (EFR) or adaptive multi-rate (AMR), the Code Excited Linear Prediction Technique (CELP) has been adopted; CELP is commonly understood as a technical combination of Code-Excitation, Long-Term Prediction and Short-Term Prediction. CELP Speech Coding is a very popular algorithm principle in speech compression area.
CELP algorithm is often based on an analysis-by-synthesis approach which is also called a closed-loop approach. In an initial CELP encoder, a weighted coding error between a synthesized speech and an original speech is minimized by using the analysis-by-synthesis approach. The weighted coding error is generated by filtering a coding error with a weighting filter W(z). The synthesized speech is produced by passing an excitation through a Short-Term Prediction (STP) filter which is often noted as 1/A(z); the STP filter is also called Linear Prediction Coding (LPC) filter or synthesis filter. One component of the excitation is called Long-Term Prediction (LTP) component; the Long-Term Prediction can be realized by using an adaptive codebook (AC) containing a past synthesized excitation; pitch periodic information is employed to generate the adaptive codebook component of the excitation; the LTP filter can be marked as 1/B(z); the LTP excitation component is scaled at least by one gain Gp. There is at least a second excitation component. In CELP, the second excitation component is called code-excitation, also called fixed codebook excitation, which is scaled by a gain Gc. The name of fixed codebook comes from the fact that the second excitation is produced from a fixed codebook in the initial CELP codec. In general, it is not always necessary to generate the second excitation from a fixed codebook. In many recent CELP coder, actually, there is no real fixed codebook. In a decoder, a post-processing block is often applied after the synthesized speech, which could include long-term post-processing and/or short-term post-processing.
Long-Term Prediction plays an important role for voiced speech coding because voiced speech has strong periodicity. The adjacent pitch cycles of voiced speech are similar to each other, which means mathematically the pitch gain Gp in the excitation express, e(n)=Gp·ep(n)+Gc·ec(n), is very high; ep(n) is one subframe of sample series indexed by n, coming from the adaptive codebook which consists of the past excitation; ec(n) is generated from the code-excitation codebook (fixed codebook) or produced without using any fixed codebook; this second excitation component is the current excitation contribution. For voiced speech, the contribution of ep(n) could be dominant and the pitch gain Gp is around a value of 1. The excitation is usually updated for each subframe. Typical frame size is 20 milliseconds and typical subframe size is 5 milliseconds. If a previous bit-stream packet is lost and the pitch gain Gp is high, the incorrect estimate of the previous synthesized excitation could cause error propagation for quite a long time after the decoder has already received a correct bit-stream packet. The partial reason of this error propagation is that the phase relationship between ep(n) and ec(n) has been changed due to the previous bit-stream packet loss. One simple solution to solve this issue is just to completely cut (remove) the pitch contribution between frames; this means the pitch gain Gp is set to zero in the encoder. Although this kind of solution solved the error propagation problem, it sacrifices too much quality when there is no bit-stream packet loss or it requires much higher bit rate to achieve the same quality. The invention explained in the following will provide a compromised solution.
A common problem of parametric speech coding is that some parameters may be very sensitive to packet loss or bit error happening during transmission from an encoder to a decoder. If a transmission channel may have a very bad condition, it is really worth to design a speech coder with good compromising between speech coding quality at a good channel condition and speech coding quality at a bad channel condition.