High quality audio transmission may typically utilize transform-based coding schemes. The input audio signal is usually processed in time-blocks called frames of certain size e.g. 20 ms. A frame is transformed by a suitable transform, such as e.g. the Modified Discrete Cosine Transform (MDCT), and the transform coefficients are then quantized and transmitted over the network.
However, when an audio codec is operated in a communication system which includes wireless or packet networks, a frame could get lost in the transmission, or arrive too late, in order to be used in a real-time scenario. A similar problem arises when the data within a frame has been corrupted, and the codec may be set to discard such corrupted frames. The above examples are called frame erasure or packet loss, and when it occurs the decoder typically invokes certain algorithms to avoid or reduce the degradation in audio quality caused by the frame erasure, and such algorithms are called frame erasure (or error) concealment-algorithms (FEC) or packet loss concealment-algorithms (PLC).
FIG. 1 illustrates an audio signal input in an encoder 10. A transform to a frequency domain is performed in step S1, a quantization is performed in step S2, and a packetization and transmission of the quantized frequency coefficients (represented by indices) is performed in step S2. The packets are received by a decoder 12 in step S4, after transmission, and the frequency coefficients are reconstructed in step S5, wherein a frame erasure (or error) concealment algorithm is performed, as indicated by an FEC unit 14. The reconstructed frequency coefficients are inverse transformed to the time domain in step S6. Thus, FIG. 1 is a system overview, in which transmission errors are handled at the audio decoder 12 in the process of parameter/waveform reconstruction, and a frame erasure concealment-algorithm performs a reconstruction of lost or corrupt frames.
The purpose of error concealment is to synthesize lost parts of the audio signal that do not arrive or do not arrive on time at the decoder, or are corrupt. When additional delay can be tolerated and/or additional bits are available one could use various powerful FEC concepts that can be based e.g. on interpolating lost frame between two good frames or transmitting essential side information.
However, in a real-time conversational scenario it is typically not possible to introduce additional delay, and rarely possible to increase bit-budget and computational complexity of the algorithm. Three exemplary FEC approaches for a real-time scenario are the following: (1) Muting, wherein missing spectral coefficients are set to zero; (2) Repetition, wherein coefficients from the last good frame are repeated; and (3) Noise injection, wherein missing spectral coefficients are the output of a random noise generator.
An example of an FEC algorithm that is commonly used by transform-based codecs is a frame repeat-algorithm that uses the repetition-approach, and repeats the transform coefficients of the previously received frame, sometimes with a scaling factor, for example as described in [1]. The repeated transform coefficients are then used to reconstruct the audio signal for the lost frame. Frame repeat-algorithms and algorithms for inserting noise or silence are attractive algorithms, because they have low computational complexity and do not require any extra bits to be transmitted or any extra delay. However, the error concealment may degrade the reconstructed signal. For example, a muting-based FEC-scheme could create large energy discontinuities and a poor perceived quality, and the use of a noise injection algorithm could lead to negative perceptual impact, especially when applied to a region with prominent tonal components.
Another approach described in [2] involves transmission of side information for reconstruction of erroneous frames by interpolation. A drawback of this method is that it requires extra bandwidth for the side information. For MDCT coefficients without side information available, amplitudes are estimated by interpolation, whereas signs are estimated by using a probabilistic model that requires a large number of past frames (50 are suggested), which may not be available in reality.
A rather complex interpolation method with multiplicative corrections for reconstruction of lost frames is described in [3].
A further drawback of interpolation based frame error concealment methods is that they introduce extra delays (the frame after the erroneous frame has to be received before any interpolation may be attempted) that may not be acceptable in, for example, real-time applications such as conversational applications.