An encoder is a device, circuitry or computer program that is capable of analyzing a signal such as an audio signal and outputting a signal in an encoded form. The resulting signal is often used for transmission, storage and/or encryption purposes. On the other hand a decoder is a device, circuitry or computer program that is capable of inverting the encoder operation, in that it receives the encoded signal and outputs a decoded signal.
In most state-of the art encoders such as audio encoders, each frame of the input signal is analyzed in the frequency domain. The result of this analysis is quantized and encoded and then transmitted or stored depending on the application. At the receiving side (or when using the stored encoded signal) a corresponding decoding procedure followed by a synthesis procedure makes it possible to restore the signal in the time domain.
Codecs are often employed for compression/decompression of information such as audio and video data for efficient transmission over bandwidth-limited communication channels.
In particular, there is a high market need to transmit and store audio signals at low bit rates while maintaining high audio quality. For example, in cases where transmission resources or storage is limited low bit rate operation is an essential cost factor. This is typically the case, for example, in streaming and messaging applications in mobile communication systems.
A general example of an audio transmission system using audio encoding and decoding is schematically illustrated in FIG. 1. The overall system basically comprises an audio encoder 10 and a transmission module (TX) 20 on the transmitting side, and a receiving module (RX) 30 and an audio decoder 40 on the receiving side.
It is commonly acknowledged that special care has to be taken in order to deal with non-stationary signals in particular for audio coding application and in general for signal compression. In audio coding, an artifact known as pre-echo distortion can arise in so-called transform coders.
Transform coders or more generally transform codecs (coder-decoder) are normally based around a time-to-frequency domain transform such as a DCT (Discrete Cosine Transform), a Modified Discrete Cosine Transform (MDCT) or another lapped transform. A common characteristic of transform codecs is that they operate on overlapped blocks of samples: overlapped frames. The coding coefficients resulting from a transform analysis or an equivalent sub-band analysis of each frame are normally quantized and stored or transmitted to the receiving side as a bit-stream. The decoder, upon reception of the bit-stream, performs dequantization and inverse transformation in order to reconstruct the signal frames.
Pre-echoes generally occur when a signal with a sharp attack begins near the end of a transform block immediately following a region of low energy.
This situation occur for instance when encoding the sound of percussion instruments, e.g. castanets, glockenspiel. In a block-based algorithm when quantizing the transform coefficients, the inverse transform at the decoder side will spread the quantization noise distortion evenly in time. This results in unmasked distortion on the low energy region proceeding in time the signal attack as illustrated in FIGS. 2A and B, where FIG. 2A illustrates the original percussion sound, and FIG. 2B illustrates the transform-coded signal showing the time spreading of coding noise leading to pre-echo distortion.
Temporal pre-masking is a psycho-acoustical property of the human hearing which has the potential to mask this distortion; however this is only possible when the transform block size is sufficiently small such that pre-masking occurs.
Pre-echo Artifact Mitigation (Prior Art)
In order to avoid this undesirable artifact, several methodologies have been proposed and successfully applied. Some of theses technologies have been standardized and are wide-spread in commercial applications.
Bit Reservoir Techniques
The idea behind bit reservoir technique is to save some bits from frames that are “easy” to encode in the frequency domain. The saved bits are thereafter used in order to accommodate the high demanding frames, like transient frames. This result in a variable instantaneous bit-rate, with some tuning it can be made such that the average bit-rate is constant. The major drawback however is that very large reservoirs are in fact needed in order to deal with certain transients and this leads to very large delay making this technology with little interest for conversational application. In addition, this methodology only slightly mitigates the pre-echo artifact.
Gain Modification and Temporal Noise Shaping
The gain modification approach applies a smoothing of transient peaks in the time-domain prior to spectral analysis and coding. The gain modification envelope is sent as side information and inverse applied on the inverse transform signal thus shaping the temporal coding noise. A major drawback of the gain modification technique is in its modification of the filter bank (e.g. MDCT) analysis window, thus introducing a broadening of the frequency response of the filter bank. This may lead to problems at low frequencies especially if the bandwidth exceeds that of the critical band.
Temporal Noise Shaping (TNS) is inspired by the gain modification technique. The gain modification is applied in the frequency domain and operates on the spectral coefficients. TNS is applied only during input attacks susceptible to pre-echoes. The idea is to apply linear prediction (LP) across frequency rather than time. This is motivated by the fact that during transients and in general impulsive signals, frequency-domain coding gain is maximized by the use of LP techniques. TNS was standardized in AAC and is proven to provide a good mitigation of pre-echo artifacts. However, the use of TNS involves LP analysis and filtering which significantly increases the complexity of the encoder and decoder. Additionally, the LP coefficients have to be quantized and sent as side information which involves further complexity and bit-rate overhead.
Window Switching
FIG. 3 illustrates window switching (MPEG-1, layer III “mp3”), where transition windows “start” and “stop” are required between the long and short windows to preserve the PR (Perfect Reconstruction) properties. This technique was first introduced by Edler [1] and is popular for pre-echo suppression particularly in the case of MDCT-based transform coding algorithms. Window switching is based on the idea of changing the time resolution of the transform upon detection of a transient. Typically this involves changing the analysis block length from a long duration during stationary signals to a short duration when transients are detected. The idea is based on two considerations:
A short window applied to the short frame containing the transient will minimize the temporal spread of coding noise and allow temporal pre-masking to take effect and render the distortion inaudible.
Allocate higher bitrates to the short temporal regions containing the transient.
Although window switching has been very successful, it presents significant drawbacks. For instance, the perceptual model and lossless coding modules of the codec have to support different time resolutions which translate usually into increased complexity. In addition, when using lapped transforms such as the MDCT, and in order to satisfy the perfect reconstruction constraints, window switching needs to insert transition windows between short and long blocks, as illustrated in FIG. 3. The need for transition windows generates further drawbacks, namely an increased delay due to the fact that switching windows cannot be done instantaneously, and also the poor frequency localization properties of transition windows leading to a dramatic reduction in coding gain.