Various techniques exist for converting into digital form and compressing a digital audio signal. The commonest techniques are:                waveform coding schemes, such as PCM (for “Pulse Code Modulation”) coding and ADPCM (for “Adaptive Differential Pulse Code Modulation”) coding,        parametric coding schemes based on analysis by synthesis such as CELP (for “Code Excited Linear Prediction”) coding, and        sub-band or transform-based perceptual coding schemes.        
These techniques process the input signal in a sequential manner sample by sample (PCM or ADPCM) or in blocks of samples termed “frames” (CELP and transform-based coding). For all these coders, the coded values are thereafter transformed into a binary train which is transmitted on a transmission channel.
Depending on the quality of this channel and the type of transport, disturbances may affect the signal transmitted and produce errors in the binary train received by the decoder. These errors may arise in an isolated manner in the binary train but very frequently occur in bursts. It is then a packet of bits corresponding to a complete signal portion which is erroneous or not received. This type of problem is encountered for example with transmissions over mobile networks. It is also encountered in transmissions over packet networks and in particular over networks of Internet type.
When the transmission system or the modules responsible for reception make it possible to detect that the data received are highly erroneous (for example on mobile networks), or that a block of data has not been received or is corrupted by binary errors (case of packet transmission systems for example), procedures for concealing the errors are then implemented.
The current frame to be decoded is then declared erased (“bad frame”). These procedures make it possible to extrapolate at the decoder the samples of the missing signal on the basis of the signals and data emanating from the previous frames.
Such techniques have been implemented mainly in the case of parametric and predictive coders (techniques of recovery/concealment of erased frames). They make it possible to greatly limit the subjective degradation of the signal perceived at the decoder in the presence of erased frames. These algorithms rely on the technique used for the coder and the decoder, and in fact constitute an extension of the decoder. The objective of devices for concealing erased frames is to extrapolate the parameters of the erased frame on the basis of the last previous frame(s) considered to be valid.
Certain parameters manipulated or coded by predictive coders exhibit a high inter-frame correlation (case of LPC (for “Linear Predictive Coding”) parameters which represent the spectral envelope, and LTP (for “Long Term Prediction”) parameters which represents the periodicity of the signal (for the voiced sounds, for example). On account of this correlation, it is much more advantageous to reuse the parameters of the last valid frame to synthesize the erased frame than to use erroneous or random parameters.
Within the context of a CELP decoding, the parameters of the erased frame are conventionally obtained as follows.
The LPC parameters of a frame to be reconstructed are obtained on the basis of the LPC parameters of the last valid frame, by simply copying the parameters or else by introducing a certain damping (technique used for example in the G723.1 standardized coder). Thereafter, a voicing or a non-voicing in the speech signal is detected so as to determine a degree of harmonicity of the signal at the erased frame level.
If the signal is unvoiced, an excitation signal can be generated in a random manner (by drawing a code word from the past excitation, by slight damping of the gain of the past excitation, by random selection from the past excitation, or also using transmitted codes which may be totally erroneous).
If the signal is voiced, the pitch period (also called “LTP lag”) is generally that calculated for the previous frame, optionally with a slight “jitter” (increase in the value of the LTP lag for consecutive error frames, the LTP gain being taken very near 1 or equal to 1). The excitation signal is therefore limited to the long-term prediction performed on the basis of a past excitation.
The complexity of calculating this type of extrapolation of erased frames is generally comparable with that of a decoding of a valid frame (or “good frame”): the parameters estimated on the basis of the past, and optionally slightly modified, are used in place of the decoding and inverse quantization of the parameters, and then the reconstructed signal is synthesized in the same manner as for a valid frame using the parameters thus obtained.
In a hierarchical coding structure, using a technique of CELP type for core coding and a transform-based coding for coding the error signal, it may be beneficial to use the time shift generated by this hierarchical decoding system for erased frame concealment.
FIG. 1a illustrates the hierarchical coding of the CELP frames C0 to C5 and the transforms M1 to M5 applied to these frames.
During the transmission of these frames to a corresponding decoder, the hatched frames C3 and C4 and the transforms M3 and M4 are erased.
Thus, at the decoder, with reference to FIG. 1b, the line referenced 10 corresponds to the reception of the frames, the line referenced 11 corresponds to the CELP synthesis and the line referenced 12 corresponds to the total synthesis after MDCT transform.
It may be noted that during the reception of frame 1 (CELP coding C1 and transform-based coding M1), the decoder synthesizes the CELP frame C1 which will be used to calculate the total synthesis signal for the following frame, and calculates the total synthesis signal for the current frame O1 (line 12) on the basis of the CELP synthesis C0, of the transform M0 and of the transform M1. This additional delay in the total synthesis is well known within the context of transform-based coding.
In this case, in the presence of errors in the binary train, the decoder operates as follows.
Upon the first error in the binary train, the decoder contains in memory the CELP synthesis of the previous frame. Thus in FIG. 1b, when frame 3 (C3+M3) is erroneous, the decoder uses the CELP synthesis C2 decoded at the previous frame.
The replacement of the erroneous frame (C3) is necessary so as to generate the following output (O4); to do this a technique for concealing erased frames also called FEC (for “Frame Erasure Concealment”) is used, as for example described in the document entitled “Method of packet errors cancellation suitable for any speech and sound compression scheme” by B. KOVESI and D. Massaloux, ISIVC-2004.
This time shift between erroneous frame detection and the need to synthesize the corresponding signal makes it possible to use techniques for transmitting error correction information for the previous CELP frame as described in “Efficient frame erasure concealment in predictive speech codecs using glotal pulse resynchronisation” T. Vaillancourt et al. published in ICASSP 2007.
In this document, a valid frame comprises information about the previous frame for improving the concealment of the erased frames and the resynchronization between the erased frames and the valid frames.
Thus, in FIG. 1b, upon reception of frame 5 (C5+M5) after the detection of two erroneous frames (frame 3 and 4), the decoder receives, in the binary train of frame 5, information about the nature of the previous frame (for example classification indication, information about the spectral envelope). Classification information is understood to mean information about voicing, non-voicing, the presence of attacks, etc.
This type of information in the binary train is for example described in the document “Wideband Speech Coding Advances in VMR-WV Standard” by M. Jelinek and R. Salami published in IEEE Transactions on audio, speech and language processing May 2007.
Thus, the decoder synthesizes the previous erroneous frame (frame 4) using a technique for concealing erased frames which benefits from the information received with frame 5, before synthesizing the CELP signal C5.
Moreover, hierarchical coding techniques have been developed for decreasing the time shift between the two coding stages. Thus, there exist transforms with low delay which decreases the time shift to half a frame. Such is for example the case with the use of a window called “Low-Overlap” presented in “Real-Time Implementation of the MPEG-4 Low-Delay Advanced Audio Coding Algorithm (AAC-LD) on Motorola's DSP56300” by J. Hilpert et al. published at the 108th AES convention in February 2000.
In these low-delay transform techniques, it is then no longer possible to benefit from the information of the valid current frame to generate the missing samples of an erased frame as for the previously described techniques, the time shift being less than a frame. The quality of the signal in the case of erroneous frames is therefore lower.
There therefore exists a requirement to improve the quality of the concealment of erased frames in a low-delay hierarchical decoding system without however introducing additional time delay.
The present invention improves the situation.