1. Field of the Invention
The present invention is generally in the field of signal coding. In particular, the present invention is in the field of speech coding and specifically of improving the packet loss concealment performance.
2. Background Art
Traditionally, all parametric speech coding methods make use of the redundancy inherent in the speech signal to reduce the amount of information that must be sent and to estimate the parameters of speech samples of a signal at short intervals. This redundancy primarily arises from the repetition of speech wave shapes at a quasi-periodic rate, and the slow changing spectral envelop of speech signal.
The redundancy of speech wave forms may be considered with respect to several different types of speech signal, such as voiced and unvoiced. For voiced speech, the speech signal is essentially periodic; however, this periodicity may be variable over the duration of a speech segment and the shape of the periodic wave usually changes gradually from segment to segment. A low bit rate speech coding could greatly benefit from exploring such periodicity. The voiced speech period is also called pitch, and pitch prediction is often named Long-Term Prediction. As for the unvoiced speech, the signal is more like a random noise and has a smaller amount of periodicity.
In either case, parametric coding may be used to reduce the redundancy of the speech segments by separating the excitation component of the speech from the spectral envelop component. The slowly changing spectral envelope can be represented by Linear Prediction (also called Short-Term Prediction). A low bit rate speech coding could also benefit a lot from exploring such a Short-Term Prediction. The coding advantage arises from the slow rate at which the parameters change. Yet, it is rare for the parameters to be significantly different from the values held within a few milliseconds. Accordingly, at the sampling rate of 8 k Hz or 16 k Hz, the speech coding algorithm is such that the nominal frame duration is in the range of ten to thirty milliseconds. A frame duration of twenty milliseconds seems to be the most common choice. In more recent well-known standards such as G.723, G.729, EFR or AMR, the Code Excited Linear Prediction Technique (“CELP”) has been adopted; CELP is commonly understood as a technical combination of Coded Excitation, Long-Term Prediction and Short-Term Prediction. Code-Excited Linear Prediction (CELP) Speech Coding is a very popular algorithm principle in speech compression area.
FIG. 1 shows the initial CELP encoder where the weighted error 109 between the synthesized speech 102 and the original speech 101 is minimized by using a so-called analysis-by-synthesis approach. W(z) is the weighting filter 110. 1/B(z) is a long-term linear prediction filter 105; 1/A(z) is a short-term linear prediction filter 103. The code-excitation 108, which is also called fixed codebook excitation, is scaled by a gain Gc 107 before going through the linear filters.
FIG. 2 shows the initial decoder which adds the post-processing block 207 after the synthesized speech.
FIG. 3 shows the basic CELP encoder which realized the long-term linear prediction by using an adaptive codebook 307 containing the past synthesized excitation 304. The periodic information of pitch is employed to generate the adaptive component of the excitation. This excitation component is then scaled by a gain Gp 305 (also called pitch gain). The two scaled excitation components are added together before going through the short-term linear prediction filter 303. The two gains (Gp and Gc) need to be quantized and then sent to the decoder.
FIG. 4 shows the basic decoder, corresponding to the encoder in FIG. 3, which adds the post-processing block 408 after the synthesized speech.
The total excitation to the short-term linear filter 303 is a combination of two components; one is from the adaptive codebook 307; another one is from the fixed codebook 308. For strong voiced speech, the adaptive codebook contribution plays important role because the adjacent pitch cycles of voiced speech are similar each other, which means mathematically the pitch gain Gp is very high (around a value of 1). The fixed codebook contribution is needed for both voiced and unvoiced speech. The combined excitation can be expressed ase(n)=Gp·ep(n)+Gc·ec(n)  (1)where ep(n) is one subframe of sample series indexed by n, coming from the adaptive codebook 307 which consists of the past excitation 304; ec(n) is from the coded excitation codebook 308 (also called fixed codebook) which is the current excitation contribution. For voiced speech, the contribution of ep(n) from the adaptive codebook could be significant and the pitch gain Gp 305 is around a value of 1. The excitation is usually updated for each subframe. Typical frame size is 20 milliseconds and typical subframe size is 5 milliseconds.
The excitation form from the fixed codebook 308 had a long history. Three major factors influence the design of the coded excitation generation. The first factor is the perceptual quality; the second one is the computational complexity; the third one is memory size required. The very initial model of the excitation consists of random noise excitation. The noise excitation can produce good quality for unvoiced speech but may be not good enough for voiced speech. Another famous excitation model is pulse-like excitation such as Multi-Pulse Excitation in which the pulse position and the magnitude of every possible pulse need to be coded and sent to the decoder. The pulse excitation can produce good quality for voiced speech. A variant pulse excitation model is called ACELP excitation model or Binary excitation model in which each pulse position index needs to be sent to the decoder; however all the magnitudes are assigned to a constant of value 1 except the magnitude signs (+1 or −1) need to be sent to the decoder. This is currently the most popular excitation model which is used in several international standards.
Gain Quantization System can be classified as Scalar Quantization (SQ) and Vector Quantization (VQ); it can also be classified as direct quantization and indirect quantization; it could be predictive quantization or non-predictive quantization; it could further be any combination of the above mentioned approaches. Scalar Quantization (SQ) means that each parameter is quantized independently (one by one). Vector Quantization (VQ) is to quantize the parameters as a group together, which usually requires pre-memorized codebook table; and the best quantized parameter vector is selected from the table to profit from correlation between parameters. Direct quantization system makes the two gains (Gp 305 and Gc 306) to be quantized directly. Indirect quantization system transforms the two parameters into another group of parameters and then quantizes the transformed parameters; the quantization indexes are sent to decoder; at decoder, the parameters are transformed back into the direct domain (the original form). Predictive quantization uses the previous quantized parameters to predict the current parameter(s) and quantizes only the unpredictable portion. The prediction can help reduce the number of bits needed to quantize the parameters; but it could introduce error propagation if the bit-stream packet is lost during transmission.
This invention will propose a transformed quantization system which could recover quickly the correct excitation energy after packet loss and significantly reduce error propagation.