Various techniques exist for converting an audiodigital signal into digital form and compressing it. The commonest techniques are:
waveform coding schemes, such as PCM (for “Pulse Code Modulation”) coding and ADPCM (for “Adaptive Differential Pulse Code Modulation”) coding,
parametric coding schemes based on analysis by synthesis such as CELP (for “Code Excited Linear Prediction”) coding, and
sub-band or transform-based perceptual coding schemes.
These techniques process the input signal in a sequential manner sample by sample (PCM or ADPCM) or in blocks of samples termed “frames” (CELP and transform coding). For all these coders, the coded values are thereafter transformed into a binary train which is transmitted on a transmission channel.
Depending on the quality of this channel and the type of transport, disturbances may affect the signal transmitted and produce errors in the binary train received by the decoder. These errors may arise in an isolated manner in the binary train but very frequently occur in bursts. It is then a packet of bits corresponding to a complete signal portion which is erroneous or not received. This type of problem is encountered for example with transmissions over mobile networks. It is also encountered in transmissions over packet networks and in particular over networks of Internet type.
When the transmission system or the modules responsible for reception make it possible to detect that the data received are highly erroneous (for example on mobile networks), or that a block of data has not been received or is corrupted by binary errors (case of packet transmission systems for example), procedures for concealing the errors are then implemented.
The current frame to be decoded is then declared erased (“bad frame”). These procedures make it possible to extrapolate at the decoder the samples of the missing signal on the basis of the signals and data emanating from the previous frames.
Such techniques have been implemented mainly in the case of parametric and predictive coders (techniques of recovery/concealment of erased frames). They make it possible to greatly limit the subjective degradation of the signal perceived at the decoder in the presence of erased frames. These algorithms rely on the technique used for the coder and the decoder, and in fact constitute an extension of the decoder. The objective of devices for concealing erased frames is to extrapolate the parameters of the erased frame on the basis of the last previous frame(s) considered to be valid.
Certain parameters manipulated or coded by predictive coders exhibit a high inter-frame correlation (case of LPC (for “Linear Predictive Coding”) parameters which represent the spectral envelope, and LTP (for “Long Term Prediction”) parameters which represents the periodicity of the signal (for the voiced sounds, for example). On account of this correlation, it is much more advantageous to reuse the parameters of the last valid frame to synthesize the erased frame than to use erroneous or random parameters.
In CELP excitation generation, the parameters of the erased frame are conventionally obtained as follows.
The LPC parameters of a frame to be reconstructed are obtained on the basis of the LPC parameters of the last valid frame, by simply copying the parameters or else by introducing a certain damping (technique used for example in the G723.1 standardized coder). Thereafter, a voicing or a non-voicing in the speech signal is detected so as to determine a degree of harmonicity of the signal at the erased frame level.
If the signal is unvoiced, an excitation signal can be generated in a random manner (by drawing a code word from the past excitation, by slight damping of the gain of the past excitation, by random selection from the past excitation, or also using transmitted codes which may be totally erroneous).
If the signal is voiced, the pitch period (also called “LTP lag”) is generally that calculated for the previous frame, optionally with a slight “jitter” (increase in the value of the LTP lag for consecutive error frames, the LTP gain being taken very near 1 or equal to 1). The excitation signal is therefore limited to the long-term prediction performed on the basis of a past excitation.
The complexity of calculating this type of extrapolation of erased frames is generally comparable with that of a decoding of a valid frame (or “good frame”): the parameters estimated on the basis of the past, and optionally slightly modified, are used in place of the decoding and inverse quantization of the parameters, and then the reconstructed signal is synthesized in the same manner as for a valid frame using the parameters thus obtained.
Other types of coding do not allow the extrapolation of an erased frame by extension of the decoder using the parameters estimated on the basis of the past. This is the case for example for the PCM temporal coding which codes the signal sample by sample, without resorting to a speech prediction model. No parameter is directly available to the decoder for performing the extrapolation.
To extrapolate the erased frames with the same performance as in the case of parametric coders the algorithm for dissimilating erased frames must therefore firstly estimate the extrapolation parameters by itself on the basis of the past decoded signal. This typically requires short-term (LPC) and long-term (LTP) correlation analyses and optionally the classification of the signal (voiced, unvoiced, plosive, etc.) thereby considerably increasing the calculation load. These analyses are for example described in the document entitled “Method of packet errors cancellation suitable for any speech and sound compression scheme” by B. KOVESI and D. Massaloux, ISIVC-2004, International Symposium on Image/Video Communications over fixed and mobile networks, July 2004. According to this technique described, the method for concealing an erased frame therefore consists of a first analysis part and a second extrapolation part producing missing samples of the signal corresponding to the erased frame.
However, for consecutive erasures these analyses have to be done only once, during the first erased frame, and then the parameters thus estimated (optionally slightly attenuated according to the length of the erasure) are used throughout the duration of the extrapolation.
Stated otherwise this increase in calculation load due to the analyses of the past signal is the same as the erased frame i.e. 5 ms or 40 ms.
However, to dimension the hardware platform—for example a processor of DSP type (for “Digital Signal Processor”)—the most unfavorable case is taken into account, that is to say maximum complexity. This worst case of complexity therefore arises in the case of short frames.
Indeed the analyses of the past signal (LPC, LTP, classification) require a given number of operations per frame, independently of the frame size. The complexity of these analyses is measured in terms of number of operations per second. This complexity therefore increases the shorter the frame length, since the number of operations per second is given by the number of operations per frame divided by the frame length—the number of operations per second is therefore inversely proportional to the frame length.
The mean complexity is also a significant parameter since it influences the energy consumption of the processor and thus the duration of autonomy of the battery of the equipment in which it is situated, such as for example a mobile terminal.
In certain cases, this calculation load remains reasonable and comparable with the calculation load of the normal decoding. For example in the case of the G.722 standardized coder, an algorithm for concealing erased frames of low complexity has been standardized in accordance with ITU-T recommendation G.722 appendix IV. The complexity of calculating the extrapolation of an erased frame of 10 ms is in this case 3 WMOPS (for “Weighted Million Operations Per Second”), this being substantially identical to the complexity of the decoding of a valid frame.
This no longer holds if the G722 coder processes shorter frames, of 5 ms for example.
Moreover, the complexity of such an algorithm for dissimilating erased frames may be penalizing in the case of coders of very low complexity such as the coder standardized in accordance with ITU-T recommendation G.711 (PCM) and these extensions such as the G.711 WB coder undergoing standardization for in particular the decoding of the low band, sampled at 8 kHz and coded by a G.711 coder followed by an improvement layer.
Indeed, the complexity of PCM coding/decoding is of the order 0.3 WMOPS, whereas that of an efficacious algorithm for dissimilating erased frames is typically of the order of 3 WMOPS based on 10-ms frames.