Briefly, it will be recalled that a speech signal can be predicted from its recent past (for example from 8 to 12 samples at 8 kHz) using parameters assessed over short windows (10 to 20 ms in this example). These short-term predictive parameters representing the vocal tract transfer function (for example for pronouncing consonants), are obtained by linear prediction coding (LPC) methods. A longer-term correlation is also used to determine periodicities of voiced sounds (for example the vowels) resulting from the vibration of the vocal cords. This involves determining at least the fundamental frequency of the voiced signal, which typically varies from 60 Hz (low voice) to 600 Hz (high voice) according to the speaker. Then a long term prediction (LTP) analysis is used to determine the LTP parameters of a long-term predictor, in particular the inverse of the fundamental frequency, often called “pitch period”. The number of samples in a pitch period is then defined by the relationship Fe/F0 (or its integer part), where:                Fe is the sampling rate, and        F0 is the fundamental frequency.        
It will be recalled therefore that the long-term prediction LTP parameters, including the pitch period, represent the fundamental vibration of the speech signal (when it is voiced), while the short-term prediction LPC parameters represent the spectral envelope of this signal.
The set of these LPC and LTP parameters thus resulting from a speech coding is transmitted by blocks to a homologous decoder via one or more telecommunications networks so that the original speech can then be reconstructed.
Within the framework of the communication of such signals by blocks, the loss of one or more consecutive blocks can occur. By the term “block” is meant a succession of signal data which can be for example a frame in mobile radiocommunication, or also a packet for example in communication over internet protocol (IP) or others.
In mobile radiocommunication for example, most predictive synthesis coding techniques, in particular coding of the “code excited linear predictive” (CELP) type, propose solutions for the recovery of erased frames. The decoder is informed of the occurrence of an erased frame, for example by the transmission of a frame erasure information originating from the channel decoder. The recovery of erased frames aims to extrapolate the parameters of the erased frame from one or more previous frames regarded as valid. Certain parameters manipulated or coded by the predictive coders have a high correlation between frames. Typically, this involves long-term prediction LTP parameters, for the voiced sounds for example, and short-term prediction LPC parameters. Due to this correlation, it is much more advantageous to reuse the parameters of the last valid frame in order to synthesize the erased frame, than to use random, even erroneous, parameters.
In standard fashion, for generating CELP excitation, the parameters of the erased frame are obtained as follows.
The LPC parameters of a frame to be reconstructed are obtained from the LPC parameters of the last valid frame, by simple copying of the parameters or also with introduction of a certain damping (technique used for example in the G723.1 standardized coder). Then, a voicing or a non-voicing is detected in the speech signal in order to determine a degree of harmonicity of the signal at the erased frame.
If the signal is non-voiced, an excitation signal can be randomly generated (by taking a code word from the past excitation, by slight damping of the gain of the past excitation, by random selection in the past excitation, or by using further transmitted codes which can be totally erroneous).
If the signal is voiced, the pitch period (also called “LTP delay”) is generally that calculated for the previous frame, optionally with a slight “jitter” (increase in the value of the LTP delay for the consecutive error frames, the LTP gain being taken to be very close to 1 or equal to 1). The excitation signal is therefore limited to the long-term prediction carried out from a past excitation.
The means of concealment of the erased frames, at decoding, are generally strongly linked to the structure of the decoder and can be common to modules of this decoder, such as for example the signal synthesis module. These means also use intermediate signals available within the decoder, such as for example the past excitation signal stored during the processing of the valid frames preceding the erased frames.
Certain techniques used to conceal the errors produced by packets lost during the transport of data coded according to a time-type coding frequently rely on waveform substitution techniques. Such techniques aim to reconstitute the signal by selecting portions of the decoded signal before the lost period, and do not implement synthesis models. Smoothing techniques are also used to avoid the artefacts produced by the concatenation of different signals.
For the decoders operating on signals coded by transform coding, the techniques for reconstructing erased frames generally rely on the structure of the coding used. Certain techniques aim to regenerate the lost transformed coefficients from the values taken by these coefficients before the erasure.
Other techniques for concealment of the erased frames have been developed jointly with the channel coding. They make use of information provided by the channel decoder, for example information relating to the degree of reliability of the parameters received. It is noted here that conversely, the subject of the present invention does not presuppose the existence of a channel coder.
In Combescure et al.:
“A 16.24.32 kbit/s Wideband Speech Codec Based on ATCELP”, P. Combescure, J. Schnitzler, K. Ficher, R. Kirchherr, C. Lamblin, A. Le Guyader, D. Massaloux, C. Quinquis, J. Stegmann, P. Vary, ICASSP (1998) Conference Proceedings,
a proposal was made for the use of an erased-frame concealment method equivalent to that used in CELP coders for a transform coder.
The drawbacks of this method were the introduction of audible spectral distortions (“synthetic” voice, unwanted resonances, etc.). These drawbacks were due in particular to the use of poorly-controlled long-term synthesis filters (single harmonic component in voiced sounds, use of portions of the past residual signal in non-voiced sounds). Moreover, the energy control is carried out here at the excitation signal level and the energy target of this signal is kept constant for the whole duration of the erasure, which also generates troublesome audible artefacts.
In FR-2.813.722, a technique is proposed for concealment of the erased frames which does not generate greater distortion at higher error rates and/or for longer erased intervals. This technique aims to avoid the excess periodicity for the voiced sounds and to improve control of the generation of the unvoiced excitation. To this end, the excitation signal (if voiced) is regarded as the sum of two signals:                a highly harmonic component whose band is limited to the low frequencies of the total spectrum, and        another less harmonic component limited to the higher frequencies. The highly harmonic component is obtained by LTP filtering. The second component is also obtained by an LTP filtering made non-periodic by the random modification of its fundamental period.        