In wideband, the bandwidth of the speech signal lies between 50 and 7000 Hz. Successive speech sequences sampled at a predetermined sampling frequency, for example 16 kHz, are processed in a CELP-type coding device using coded-sequence-excited linear prediction (for example, ACELP: “algebraic-code-excited linear-prediction”), well known to the person skilled in the art, and described in particular in recommendation ITU-TG 729, version 3/96, entitled “Coding of speech at 8 kbits/s by conjugate structure-algebraic coded sequence excited linear prediction”. The main characteristics and operation of such a coder will now be briefly described while referring to FIG. 1, the person skilled in the art being able to refer for all useful purposes, for further details, to the above-mentioned recommendation G 729.
The prediction coder CD, of the CELP type, is based on the model of code-excited linear predictive coding. The coder operates on voice super-frames equivalent for example to 20 ms of signal and each comprising 320 samples. The extraction of the linear prediction parameters, i.e. the coefficients of the linear prediction filter also referred to as the short-term synthesis filter 1/A(z), is performed for each speech super-frame. On the other hand, each super-frame is subdivided into frames of 5 ms comprising 80 samples. Every frame, the voice signal is analyzed to extract therefrom the parameters of the CELP prediction model (i.e. in particular, a long-term excitation digital word vi extracted from an adaptive coded directory LTD, also dubbed “adaptive long-term dictionary”, an associated long-term gain Ga, a short-term excitation word cj, extracted from a fixed coded directory STD, also dubbed “short-term dictionary”, and an associated short-term gain Gc).
These parameters are thereafter coded and transmitted. At reception, these parameters serve, in a decoder, to recover the excitation parameters and the predictive filter parameters. The speech is then reconstructed by filtering this excitation stream in a short-term synthesis filter.
Whereas the adaptive dictionary LTD contains digital words representative of tonal lags representative of past excitations, the short-term dictionary STD is based on a fixed structure, for example of the stochastic type or of the algebraic type, using a model involving an interleaved permutation of Dirac pulses. In the case of an algebraic structure, the coded directory, which contains innovative excitations also referred to as algebraic or short-term excitations, each vector contains a certain number of nonzero pulses, for example four, each of which may have the amplitude +1 or −1 with predetermined positions.
The processing means of the coder CD functionally includes first extraction means MEXT 1 intended to extract the long-term excitation word, and second extraction means MEXT 2 intended to extract the short-term excitation word. Functionally, these means are embodied for example in software fashion within a processor.
These extraction means comprise a predictive filter PF having a transfer function equal to 1/A(z), as well as a perceptual weighting filter PWF having a transfer function W(z). The perceptual weighting filter is applied to the signal to model the perception of the ear. Furthermore, the extraction means comprise means MSEM intended to perform a minimization of a mean square error. The synthesis filter PF of the linear prediction models the spectral envelope of the signal. The linear predictive analysis is performed every super-frame, in such a way as to determine the linear predictive filtering coefficients. The latter are converted into pairs of spectral lines (LSP: “Line Spectrum Pairs”) and digitized by predictive vector quantization in two steps.
Each 20 ms speech super-frame is divided into four frames of 5 ms each containing 80 samples. The quantized LSP parameters are transmitted to the decoder once per super-frame whereas the long-term and short-term parameters are transmitted at each frame. The quantized and nonquantized coefficients of the linear prediction filter are used for the most recent frame of a super-frame, while the other three frames of the same super-frame use an interpolation of these coefficients. The open-loop tonal lag is estimated, for example, every two frames on the basis of the perceptually weighted voice signal. Next, the following operations are repeated at each frame.
The long-term target signal XLT is calculated by filtering the sampled speech signal s(n) by the perceptual weighting filter PWF. The zero-input response of the weighted synthesis filter PF, PWF is thereafter subtracted from the weighted voice signal so as to obtain a new long-term target signal. The impulse response of the weighted synthesis filter is calculated. A closed-loop tonal analysis using minimization of the mean square error is thereafter performed so as to determine the long-term excitation word vi and the associated gain Ga, via the target signal and of the impulse response, by searching around the value of the open-loop tonal lag.
The long-term target signal is thereafter updated by subtraction of the filtered contribution y of the adaptive coded directory LTD and this new short-term target signal XST is used during the exploration of the fixed coded directory STD to determine the short-term excitation word cj and the associated gain Gc. Here again, this closed-loop search is performed by minimization of the mean square error. Finally, the adaptive long-term dictionary LTD as well as the memories of the filters PF and PWF, are updated via the long-term and short-term excitation words thus determined.
The quality of a CELP algorithm depends strongly on the richness of the short term excitation dictionary STD, for example an algebraic excitation dictionary. Whereas the effectiveness of such an algorithm is unquestionable for narrow bandwidth signals (300-3400 Hz), problems arise in respect of wideband signals.
It has been observed that even with a very rich dictionary, the speech encoding algorithm produces two types of problems:
1) totally inadequate overall quality of reconstructed speech (the reconstructed speech lacks presence, the energy level is highly variable, the timbre of the voice is hardly recognizable, etc.); and
2) a reconstructed signal corrupted by three kinds of noise:                a harmonic noise at high frequency (comb-like noise),        a considerable high-frequency noise, such as a quantization noise, and        a noise at low frequency (rumbling noise), such as a straw broom struck on the ground at regular intervals.        
An improvement in the overall quality of the speech could be obtained by partial or total elimination of such noise.