The present invention relates to the processing of digital audio signals (particularly speech signals).
It relates to a coding/decoding system suitable for the transmission/reception of such signals. More particularly, the present invention relates to a processing on reception which makes it possible to improve the quality of the decoded signals when data blocks are lost.
Different techniques exist for digitally converting and compressing a digital audio signal. The most common techniques are:                waveform encoding methods such as pulse code modulation (PCM) and adaptive differential pulse code modulation (ADPCM).        analysis-by-synthesis coding methods such as code excited linear prediction (CELP) coding and        sub-band perceptual coding methods or transform coding.        
These techniques process the input signal sequentially, sample by sample (PCM or ADPCM) or by blocks of samples called “frames” (CELP and transform coding). Briefly, it will be recalled that a speech signal can be predicted from its recent past (for example from 8 to 12 samples at 8 kHz) using parameters assessed over short windows (10 to 20 ms in this example). These short-term predictive parameters representing the vocal tract transfer function (for example for pronouncing consonants), are obtained by linear prediction coding (LPC) methods. There is also a longer-term correlation associated with the quasi-periodicities of speech (for example voiced sounds such as the vowels) which are due to the vibration of the vocal cords. This involves determining at least the fundamental frequency of the voice signal, which typically varies from 60 Hz (low voice) to 600 Hz (high voice) according to the speaker. Then a long term prediction (LTP) analysis is used to determine the LTP parameters of a long-term predictor, in particular the inverse of the fundamental frequency, often called “pitch period”. The number of samples in a pitch period is then defined by the relationship Fe/F0 (or its integer part), where:                Fe is the sampling rate, and        F0 is the fundamental frequency.        
It will be recalled therefore that the long-term prediction LTP parameters, including the pitch period, represent the fundamental vibration of the speech signal (when voiced), while the short-term prediction LPC parameters represent the spectral envelope of this signal.
In certain coders, the set of these LPC and LTP parameters thus resulting from a speech coding can be transmitted by blocks to a homologous decoder via one or more telecommunications networks so that the original speech can then be reconstructed.
However, reference will then be made (by way of example) to the G.722 coding system at 48, 56 and 64 kbit/s standardized by ITU-T for the wideband transmission of speech signals (which are sampled at 16 kHz). The G.722 coder has an ADPCM coding scheme in two sub-bands obtained by a quadrature mirror filter bank (QMF). For further details, reference can usefully be made to the text of the G.722 recommendation.
FIG. 1 of the state of the art shows the coding and decoding structure according to the G.722 recommendation. Blocks 101 to 103 represent the transmission QMF filter bank (spectral separation into high 102 and low 100 frequencies and sub-sampling 101 and 103), applied to the input signal Si. The next blocks 104 and 105 correspond respectively to the low-band and high-band ADPCM coders. The low-band output of the ADPCM coder is specified by a mode value of 0, 1, or 2, indicating respectively a 6, 5 or 4-bit output per sample, while the high-band output of the ADPCM coder is fixed (two bits per sample). Within the decoder are the equivalent ADPCM decoding blocks (blocks 106 and 107) the outputs of which are combined in the QMF reception filter bank (over-sampling 108 and 110, inverse filters 109, 111 and merging of the high and low frequency bands 112) in order to generate the synthesis signal So.
A general problem examined here relates to correcting the loss of blocks on decoding.
In fact, the bitstream output from the coding is generally formatted in binary blocks for transmission over many network types. These are called for example “internet protocol (IP) packets” for blocks transmitted via the Internet network, “frames” for blocks transmitted over asynchronous transfer mode (ATM) networks, or others. The blocks transmitted after coding can be lost for various reasons:                if a network router is overloaded and dumps its queue,        if the block is received with a delay (therefore not taken into account) during a continuous-flow decoding in real time,        if a received block is corrupted (for example if its CRC parity code is not verified).        
When a loss of one or more consecutive blocks occurs, the decoder must reconstruct the signal without information on the lost or erroneous blocks. It relies on the information previously decoded from the valid blocks received. This problem, called “correction of lost blocks” (or also, hereafter, “correction of erased frames”) is in fact more general than simply extrapolating missing information, as the loss of frames often causes a loss of synchronization between coder and decoder, in particular when the latter are predictive, as well as problems of continuity between the extrapolated information and the decoded information after a loss. The correction of erased frames therefore also encompasses status information restoration and re-convergence techniques and others.
Annex I of the ITU-T G.711 recommendation describes a correction of erased frames suitable for PCM coding. As PCM coding is not predictive, the correction of frame losses therefore simply amounts to extrapolating the missing information and ensuring the continuity between a reconstructed frame and the correctly received frames, following a loss. The extrapolation is implemented by repetition of the past signal in a manner synchronous with the fundamental frequency (or inversely, “pitch period”), i.e. simply by repeating the pitch periods. The continuity is ensured by a smoothing or cross-fading between received samples and extrapolated samples.
In the document:
“A packet loss concealment method using pitch waveform repetition and internal state update on the decoded speech for the sub-band ADPCM wideband speech codec”, M. Serizawa and Y. Nozawa, IEEE Speech Coding Workshop, pages 68-70 (2002), a correction of erased frames was proposed for the G.722 standardized coder/decoder by extrapolating a lost frame using a pitch-period repetition algorithm (repetition which can be similar to that described in Annex I of the G.711 recommendation). In order to update G.722 coder states (filter memory and pitch adaptation memory), the frame thus extrapolated is divided into two sub-bands which are re-encoded by ADPCM coding.
However, such techniques for the correction of frame losses by repetition of pitch periods can only operate correctly if the past signal is stationary or at least cyclostationary. They therefore rely on the implicit hypothesis that the signal associated with the lost frame (that must be extrapolated) is “similar” to the signal decoded up to the frame loss. In the case of the speech signal, this stationarity hypothesis is only strictly valid for sounds such as a portion of vowels to be repeated. For example, a vowel “a” can be repeated several times (which gives “aaaa, etc.” without causing hearing discomfort). A speech signal comprises sounds called “transitories” (non-stationary sounds typically including the attacks (beginnings) of vowels and the sounds called “plosives” which correspondent to the short consonants such as “p”, “b”, “d”, “t”, “k”). Thus, if for example a frame is lost immediately after the sound “t”, a correction of a loss of frames by simple repetition will generate a sequence of a burst of “t”s (“t-t-t-t-t”), which is very unpleasant to the ear, when there is a loss of several successive frames (for example five consecutive losses).
FIGS. 2a and 2b illustrate this acoustic effect in the case of a wideband signal encoded by a coder according to the G.722 recommendation. More particularly, FIG. 2a shows a speech signal decoded on an ideal channel (without frame loss). In the example shown, this signal corresponds to the French word “temps”, divided into two French phonemes: /t/ then /an/. The vertical dotted lines show the boundaries between frames. The length of the frames under consideration here is of the order of 10 ms. FIG. 2b shows the signal decoded according to a technique similar to that of Serizawa et at cited above, when a loss of frames immediately follows the phoneme /t/. This FIG. 2b clearly shows the problem of repetition of the past signal. It is noted that the phoneme /t/ is repeated in the extrapolated frame. It is also present in the next frame(s) as the extrapolation is slightly extended after a loss, in the example shown, in order to carry out a cross-fading with the decoding under normal conditions (i.e. in the presence of useful data in the received signal).
The problem of repetition of plosives has apparently never been mentioned in the known prior art.