A media encoder is a device, circuitry or computer program that is capable of analyzing an information stream such as an audio, video or image data stream and outputting an information stream representing the media in an encoded form. The resulting information is often used for transmission, storage and/or encryption purposes. On the other-hand a decoder is a device, circuitry or computer program that is capable of inverting the encoder operation, in that it receives the encoded information stream and outputs a decoded media stream.
In most state-of the art audio and video encoders, each frame of the input signal is analyzed in the frequency domain. The result of this analysis is quantized and encoded and then transmitted or stored depending on the application. At the receiving side, or when using the stored encoded signal, a decoding procedure followed by a synthesis procedure allow to restore the signal in the time domain.
Codecs are often employed for compression/decompression of information such as audio and video data for efficient transmission over bandwidth-limited communication channels.
The most common audio and video codecs are sub-band codecs and transform codecs. A sub-band codec is based around a filter bank, and a transform codec is normally based around a time-to-frequency domain transform as for example the DCT (Discrete Cosine Transform). However, these two types of codecs can be regarded as mathematically equivalent. In a sense they are based on the same principle, where a transform codec can be seen as a sub-band codec with a large number of sub-bands.
A common characteristic of these codecs is that they operate on blocks of samples: frames. The coding coefficients resulting from a transform analysis or a sub-band analysis of each frame are quantized according to a dynamic bit allocation, and may vary from frame to frame. The decoder, upon reception of the bit-stream, computes the bit allocations and decodes the encoded coefficients.
In packet-based communications, the quantized coding coefficients and/or parameters may be grouped in packets. A packet may contain data relevant to several frames, one frame or contain only partial frame data.
Under adverse channel conditions, the encoded/compressed information from the coder may get lost or arrive at the decoding side with errors. In general, transmission of audio, video and other relevant data under adverse channel conditions has become one of the most challenging problems today. In order to alleviate the effect of errors introduced by packet losses or corrupted data during transmission, so-called error concealment is often employed to reduce the degradation of the quality of audio, video or other data represented by the coding coefficients.
Error concealment schemes typically rely on producing a replacement for the quantized coding coefficient(s) of a lost or more generally speaking erroneous packet that is similar to the original. This is possible since information such as audio, and in particular speech, exhibits large amounts of short-term self-similarity. As such, these techniques work optimally for relatively small loss rates (10%) and for small packets (4-40 ms).
A technique known in the field of information transmission over unreliable channels is multiple description coding. The coder generates several different descriptions of the same audio signal and the decoder is able to produce a useful reconstruction of the original audio signal with any subset of the encoded descriptions. This technique assumes the occurrence of an error or a loss independently on each description. This would mean that each description would be transmitted on its own channel or that the descriptions share the same channel but are displaced, in time, with respect to each other. In this case the probability that the decoder receives valid data at each moment is high. The loss of one description can therefore be bridged by the availability of another description of the same signal. The method obviously increases the overall delay between the transmitter and the receiver. Furthermore, either the data rate has to be increased or some quality has to be sacrificed in order to allow the increase in redundancy.
In the case of block or frame oriented transform codecs, the estimation of missing signal intervals can be done in the time domain, i.e. at the output of the decoder, or in the frequency domain, i.e. internally to the decoder.
In the time domain, several error concealment techniques are already known in the prior art. Rudimentary technology as the muting-based methods repair their losses by muting the output signal for as long as the data is erroneous. The erroneous data is replaced by a zero signal. Although very simple, this method leads to very unpleasant effects due to the perceived discontinuities it introduces with sudden falls of the signal energy.
The method of repetition is very similar to the muting technique, but instead of replacing the data by a zero signal when erroneous data occur, it repeats a part of the data that was last received. This method performs better than muting at the expense of an increase of memory consumption. The performance of this method is however limited and some quite annoying artifacts occur. For instance, if the last received frame was a drumbeat, then the latter is repeated which may lead to a double drumbeat where only one drumbeat was expected. Other artifacts may occur if, for instance, the frequency of repetition is short, which introduces a buzzy sound due to a comb filtering effect.
Other more sophisticated techniques aim at interpolating the audio signal by, for example, either waveform substitution, pitch based waveform replication or time scale modification. These techniques perform much better than the previously described rudimentary techniques. However, they require much more complexity. Moreover, the amount of delay that is required to perform the interpolation is, in many cases, unacceptable.
Techniques well known in the literature of audio restoration, e.g. [1], [2], [3], offer useful insights, and in fact deal with similar problems.
Error concealment in the frequency-domain has been considered in [4], [5]. In the case of the DCT (Discrete Cosine Transform) transform, it is found that a simple concealment technique is to clip large DCT coefficients.
In [6], a data substitution approach is employed with a hearing adjusted choice of the spectral energy. More particularly, a pattern is found in the intact audio data prior to the occurrence of erroneous data. When this pattern is found, replacement data is determined based on this pattern.
In [7], a frequency-domain error concealment technique is described. The described technique is quite general and applies to transform coders. It uses prediction in order to restore the lost or erroneous coefficients. The prediction of an erroneous bin/frequency channel coefficient is based on the past coefficients of the same bin/channel, and may thus consider how the phase in a bin/frequency channel evolves over time in an attempt to preserve the so-called horizontal phase coherence. In some cases, this technique may provide quite satisfactory results.
However, the error concealment technique proposed in [7] generally results in a loss of so-called vertical phase coherence, which may lead to frame discontinuities and perceived artifacts.
In [8], Wiese et al describe a technique for error concealment that is based on switching between several masking strategies, which include at least muting a sub-band and repeating or estimating the sub-band.