One example of a CELP coder is the system covered by ITU-T Recommendation G.729, which is designed for speech signals in the telephone band from 300 hertz (Hz) to 3400 Hz sampled at 8 kHz and transmitted at a fixed bit rate of 8 kilo bits per second (kbps) using 10 millisecond (ms) frames. The operation of this coder is described in detail in the paper by R. Salami, C. Laflamme, J. P. Adoul, A. Kataoka, S. Hayashi, T. Moriya, C. Lamblin, D. Massaloux, S. Proust, P. Kroon and Y. Shoham, “Design and description of CS-ACELP: a toll quality 8 kbps speech coder”, IEEE Trans. on Speech and Audio Processing, Vol. 6-2, March 1998, pp. 116-130.
FIG. 1(a) is a high-level view of a G.729 coder. This figure shows high-pass preprocessing filtering 101 for eliminating signals at frequencies below 50 Hz. The filtered speech signal S(n) is then analyzed by the block 102 to determine a linear prediction coding (LPC) filter Â(z) that is sent to the multiplexer 104 in the form of an index that indexes the quantized vector (QV) in a dictionary.
The original signal S(n) filtered by the filter Â(z), which is referred to as the excitation signal, is processed by the block 103 to extract from it the parameters listed in the table in FIG. 2. Those parameters are then coded and sent the multiplexer MUX 104.
FIG. 1(b) shows in detail the operation of the excitation coding block 103. As can be seen in the figure, the excitation signal is coded in three steps:                in a first step, long-term prediction (LTP) filtering is effected by the blocks 106, 107, 111; the LTP filter of the G.729 coder is a first order filter; the adaptive excitation period P, which is also known as the “pitch” period, expressed as an integer value P0 and where appropriate complemented by a fractional value P0— fractional, and the adaptive excitation gain gp, also known as the “pitch” gain, are determined by analysis by synthesis to minimize the error between the target excitation signal from the block 105 and the synthesized signal given by x(n)=gp·x(n−P), n representing a sample of the signal;        then, in a second step, the residual difference between these two signals is modeled, firstly, by a fixed code c(n), also known as an innovator code, extracted from an ACELP innovator dictionary 108 with 4 pulses ±1, and, secondly, by a fixed excitation gain gc 109; the fixed code c(n) and the gain gc are determined by minimizing at 111′ the error between the residual signal from the preceding LTP stage and the signal gc·c(n);        finally in a final step, the resulting parameters, namely the pitch period P, the fixed code c(n), the pitch gain gp, and the fixed excitation gain gc, are coded and sent to the multiplexer 104.        
FIG. 1(c) shows how a standard G.729 decoder reconstructs the speech signal from data received by the demultiplexer 112 from the multiplexer 104. The excitation signal is reconstituted in the form of 5 ms sub-frames by adding two contributions:                a first contribution that results from decoding (115) the pitch period P and decoding (118) the pitch gain gp to reconstitute at the output of the blocks 116, 117 the adaptive excitation LTP signal x(n)=gp·x(n−P);        a second contribution that results from decoding (113) the fixed excitation signal c(n) scaled by the gain gp decoded by the block 118 to reconstitute the fixed excitation signal gc·c(n);        these two contributions are then added to give the decoded excitation signal x(n)=gp·x(n−P)+gc·c(n).        
The decoded excitation signal is shaped by an LPC synthesis filter 120, the coefficients of which are decoded by the block 119 in the LSF (line spectral frequency) domain, and interpolated at the 5 ms sub-frame level. To improve quality and to conceal certain coding artifacts, the reconstructed signal is then processed by an adaptive post-filter 121 and by a high-pass post-processing filter 122. The FIG. 1(c) decoder therefore relies on the source-filter model to synthesize the signal.
With the excitation signal coming from the long-term prediction (LTP) filter, and with the aim of generating an excitation signal capable of rapidly tracking the attack of the signal, CELP coders generally authorize the choice of a pitch gain gp greater than 1. Consequently, the decoder is locally unstable. However, this instability is controlled by the analysis by synthesis model, which continuously minimizes the difference between the excitation signal LTP and the original target signal.
In the event of transmission errors or loss of frames, such instability can lead to serious deterioration caused by the offset between the coder and the decoder. Under these circumstances, a pitch gain value gp that is not received in a frame is generally replaced by the value gp in the preceding frame, and although the variable nature of the speech signal consisting of alternating voiced periods with a pitch gain close to 1 and non-voiced periods with a pitch gain less than 1 generally limits potential problems linked to this local instability, it nevertheless remains true that, for some signals, in particular voiced signals, transmission errors in periodic stationary areas can cause serious deterioration if, for example, the replacement gain gp is higher than the real gain and the frame concerned is followed by high-gain frames, as occurs during the attack of a signal. This situation then leads quickly to saturation of the LTP filter by a cumulative effect linked to the recursive character of long-term predictive filtering.
A first solution to this problem is to limit the pitch gp to 1, but this constraint has the effect of degrading the performance of the CELP coders during the attack of a signal.
Other solutions propose to limit the pitch gain gp to a value less than or equal to 1 only if this is deemed necessary. In particular:                The method described in U.S. Pat. No. 5,960,386 can be divided into a number of stages executed in the coder. First of all, there is a procedure for detecting possible instability using the pitch gain previously calculated and an average of preceding pitch gains. If there is no risk of instability, the pitch gain previously calculated is retained. Otherwise, an iterative pitch gain control procedure adapts this gain to eliminate the risk of instability.        A procedure for detecting instabilities in the coder is described U.S. Pat. Nos. 5,893,060 and 5,987,406. It uses LSP parameters to determine the presence of resonance in the spectrum, calculates the duration of the resonance, expressed as a number of frames, and evaluates the possibility of instability as a function of the pitch gain value. If instability is detected, the value of the pitch gain is saturated at a threshold and the search for the gain vector in the vectorial quantizing of the pitch gains is modified so that the vector chosen has a pitch gain value below the threshold.        The above-mentioned paper by R. Salami and U.S. Pat. No. 5,708,757 describe a procedure for detecting possible saturation or for calculating the associated pitch gain value present in the standard G.729 coder. This method, known as “taming”, takes into account the maximum potential error of the decoder in the excitation calculation. If this error exceeds a certain threshold when the pitch gain is greater than 1, corresponding to an unstable filter, the gain is modified to take a value less than 1 in order to stabilize the filter. The idea is therefore to detect, in the coder, areas in which the accumulation of preceding transmission errors can cause saturation of the long-term filter that is locally unstable, in particular during long strongly-voiced passages. These passages are detected by examining the output of a second long-term filter with constant excitation that simulates the maximum potential error. An identical technique is referred to in ITU-T Recommendation G.723.1, where the coder uses a fifth long-term predictor for which the pitch gain is a vector of 5 coefficients applied to 5 consecutive samples from the past. These gain vectors can be quantized by vectorial quantization. Although the stability of a first order long-term filter, like that of the G.729 coder, is very easy to verify by comparing the single-gain coefficient with the value 1, this verification is much more complicated for a higher order long-term filter. The stability of a long-term filter using a gain set also depends on the nature of the signal, for example the pitch. Thus the same gain set can be stable in one situation but unstable in another. This makes it difficult to estimate error propagation, because the nature of the potential error may not be known to the coder, and it is not a simple matter to detect potentially unstable areas or to determine the attenuation to be applied to re-stabilize the filter. The solution implemented in Recommendation G.723.1 is to find for each possible gain vector of the coder an equivalent average first order gain through a learning process. These values are stored in a table. This equivalent first order filter is therefore used to estimate the maximum potential cumulative error in the long-term filter and thereby to identify unstable areas in which the gain must be limited in the event of a high cumulative error and the gain to be applied to stabilize the filter must be calculated.        
However, the solutions proposed by these known techniques to avoid the risk of saturation of the LTP filters in the presence of losses or transmission errors cause the following problems:                The decision to modify the gain gp associated with long-term prediction being made in the coder a priori, it is not possible, after frames have been lost, to control completely the state of the decoder and its behavior, which by hypothesis are unknown to the coder. Also, the existing techniques can continue to cause audio deterioration on decoding in the event of transmission errors despite the decision taken by the coder to modify the gain.        The limitation to 1 of the pitch gain gp associated with the techniques described above can lead to slight deterioration of quality, for example in attack phases, which normally generate gains greater than 1. The triggering threshold chosen is a compromise between quality and security. A low threshold would trigger limitation too often, causing unnecessary deterioration, especially in the absence of transmission errors. Conversely, a higher threshold would not guarantee sufficient protection in the event of high error rates.        