The present invention is related to audio processing and, particularly, to audio processing in the context of audio pre-processing and audio post-processing.
PRE-Echoes: The Temporal Masking Problem
Classic filterbank based perceptual coders like MP3 or AAC are primarily designed to exploit the perceptual effect of simultaneous masking, but also have to deal with the temporal aspect of the masking phenomenon: Noise is masked a short time prior to and after the presentation of a masking signal (pre-masking and post-masking phenomenon). Post-masking is observed for a much longer period of time than pre-masking (in the order of 10.0-50.0 ms instead of 0.5-2.0 ms, depending on the level and duration of the masker).
Thus, the temporal aspect of masking leads to an additional requirement for a perceptual coding scheme: In order to achieve perceptually transparent coding quality the quantization noise also must not exceed the time-dependent masked threshold.
In practice, this requirement is not easy to achieve for perceptual coders because using a spectral signal decomposition for quantization and coding implies that a quantization error introduced in this domain will be spread out in time after reconstruction by the synthesis filterbank (time/frequency uncertainty principle). For commonly used filterbank designs (e.g. a 1024 lines MDCT) this means that the quantization noise may be spread out over a period of more than 40 milliseconds at CD sampling rate. This will lead to problems when the signal to be coded contains strong signal components only in parts of the analysis filterbank window, i. e. for transient signals. In particular, quantization noise is spread out before the onsets of the signal and in extreme cases may even exceed the original signal components in level during certain time intervals. A well-known example of a critical percussive signal is a castanets recording where after decoding quantization noise components are spread out a certain time before the “attack” of the original signal. Such a constellation is traditionally known as a “pre-echo phenomenon” [Joh92b].
Due to the properties of the human auditory system, such “pre-echoes” are masked only if no significant amount of coding noise is present longer than ca. 2.0 ms before the onset of the signal. Otherwise, the coding noise will be perceived as a pre-echo artifact, i.e. a short noise-like event preceding the signal onset. In order to avoid such artifacts, care has to be taken to maintain appropriate temporal characteristics of the quantization noise such that it will still satisfy the conditions for temporal masking. This temporal noise shaping problem has traditionally made it difficult to achieve a good perceptual signal quality at low bit-rates for transient signals like castanets, glockenspiel, triangle etc.
Applause-Like Signals: An Extremely Critical Class of Signals
While the previously mentioned transient signals may trigger pre-echoes in perceptual audio codecs, they exhibit single isolated attacks, i.e. there is a certain minimum time until the next attack appears. Thus, a perceptual coder has some time to recover from processing the last attack and can, e.g., collect again spare bits to cope with the next attack (see ‘bit reservoir’ as described below). In contrast to this, the sound of an applauding audience consists of a steady stream of densely spaced claps, each of which is a transient event of its own. FIG. 11 shows an illustration of the high frequency temporal envelope of a stereo applause signal. As can be seen, the average time between subsequent clap events is significantly below 10 ms.
For this reason, applause and applause-like signals (like rain drops or crackling fireworks) constitute a class of extremely difficult to code signals while being common to many live recordings. This is also true when employing parametric methods for joint coding of two or more channels [Hot08].
Traditional Approaches to Coding of Transient Signals
A set of techniques has been proposed in order to avoid pre-echo artifacts in the encoded/decoded signal:
Pre-Echo Control and Bit Reservoir
One way is to increase the coding precision for the spectral coefficients of the filterbank window that first covers the transient signal portion (so-called “pre-echo control”, [MPEG1]). Since this considerably increases the amount of bits that may be used for the coding of such frames this method cannot be applied in a constant bit rate coder. To a certain degree, local variations in bit rate demand can be accounted for by using a bit reservoir ([Bra87], [MPEG1]). This technique permits to handle peak demands in bit rate using bits that have been set aside during the coding of earlier frames while the average bit rate still remains constant.
Adaptive Window Switching
A different strategy used in many perceptual audio coders is adaptive window switching as introduced by Edler [Edl89]. This technique adapts the size of the filterbank windows to the characteristics of the input signal. While stationary signal parts will be coded using a long window length, short windows are used to code the transient parts of the signal. In this way, the peak bit demand can be reduced considerably because the region for which a high coding precision is involved is constrained in time. Pre-echoes are limited in duration implicitly by the shorter transform size.
Temporal Noise Shaping (TNS)
Temporal Noise Shaping (TNS) was introduced in [Her96] and achieves a temporal shaping of the quantization noise by applying open-loop predictive coding along frequency direction on time blocks in the spectral domain.
Gain Modification (Gain Control)
Another way to avoid the temporal spread of quantization noise is to apply a dynamic gain modification (gain control process) to the signal prior to calculating its spectral decomposition and coding.
The principle of this approach is illustrated in FIG. 12. The dynamics of the input signal is reduced by a gain modification (multiplicative pre-processing) prior to its encoding. In this way, “peaks” in the signal are attenuated prior to encoding. The parameters of the gain modification are transmitted in the bitstream. Using this information the process is reversed on the decoder side, i.e. after decoding another gain modification restores the original signal dynamics.
[Lin93] proposed a gain control as an addition to a perceptual audio coder where the gain modification is performed on the time domain signal (and thus to the entire signal spectrum).
Frequency dependent gain modification/control has been used before in a number of instances:
Filter-based Gain Control: In his dissertation [Vau91], Vaupel notices that full band gain control does not work well. In order to achieve a frequency dependent gain control he proposes a compressor and expander filter pair which can be dynamically controlled in their gain characteristics. This scheme is shown in FIGS. 13a and 13b. 
The variation of the filter's frequency response is shown in FIG. 13b. 
Gain Control With Hybrid Filterbank (illustrated in FIG. 14): In the SSR profile of the MPEG-2 Advanced Audio Coding [Bos96] scheme, gain control is used within a hybrid filterbank structure. A first filterbank stage (PQF) splits the input signal into four bands of equal width. Then, a gain detector and a gain modifier perform the gain control encoder processing. Finally, as a second stage, four separate MDCT filterbanks with a reduced size (256 instead of 1024) split the resulting signal further and produce the spectral components that are used for subsequent coding.
Guided envelope shaping (GES) is a tool contained in MPEG Surround that transmits channel-individual temporal envelope parameters and restores temporal envelopes on the decoder side. Note that, contrary to HREP processing, there is no envelope flattening on the encoder side in order to maintain backward compatibility on the downmix. Another tool in MPEG Surround that functions to perform envelope shaping is Subband Temporal Processing (STP). Here, low order LPC filters are applied within a QMF filterbank representation of the audio signals.
Related conventional technology is documented in Patent publications WO 2006/045373 A1, WO 2006/045371 A1, WO2007/042108 A1, WO 2006/108543 A1, or WO 2007/110101 A1.
A bit reservoir can help to handle peak demands on bitrate in a perceptual coder and thereby improve perceptual quality of transient signals. In practice, however, the size of the bit reservoir has to be unrealistically large in order to avoid artifacts when coding input signals of a very transient nature without further precautions.
Adaptive window switching limits the bit demand of transient parts of the signal and reduced pre-echoes through confining transients into short transform blocks. A limitation of adaptive window switching is given by its latency and repetition time: The fastest possible turn-around cycle between two short block sequences involves at least three blocks (“short”→“stop”→“start”→“short”, ca. 30.0-60.0 ms for typical block sizes of 512-1024 samples) which is much too long for certain types of input signals including applause. Consequently, temporal spread of quantization noise for applause-like signals could only be avoided by permanently selecting the short window size, which usually leads to a decrease in the coder's source-coding efficiency.
TNS performs temporal flattening in the encoder and temporal shaping in the decoder. In principle, arbitrarily fine temporal resolution is possible. In practice, however, the performance is limited by the temporal aliasing of the coder filterbank (typically an MDCT, i.e. an overlapping block transform with 50% overlap). Thus, the shaped coding noise appears also in a mirrored fashion at the output of the synthesis filterbank.
Broadband gain control techniques suffer from a lack of spectral resolution. In order to perform well for many signals, however, it is important that the gain modification processing can be applied independently in different parts of the audio spectrum because transient events are often dominant only in parts of the spectrum (in practice the events that are difficult to code are present mostly in the high frequency part of the spectrum). Effectively, applying a dynamic multiplicative modification of the input signal prior to its spectral decomposition in an encoder is equivalent to a dynamic modification of the filterbank's analysis window. Depending on the shape of the gain modification function the frequency response of the analysis filters is altered according to the composite window function. However, it is undesirable to widen the frequency response of the filterbank's low frequency filter channels because this increases the mismatch to the critical bandwidth scale.
Gain Control using hybrid filterbank has the drawback of increased computational complexity since the filterbank of the first stage has to achieve a considerable selectivity in order to avoid aliasing distortions after the latter split by the second filterbank stage. Also, the cross-over frequencies between the gain control bands are fixed to one quarter of the Nyquist frequency, i.e. are 6, 12 and 18 kHz for a sampling rate of 48 kHz. For most signals, a first cross-over at 6 kHz is too high for good performance.
Envelope shaping techniques contained in semi-parametric multi-channel coding solutions like MPEG Surround (STP, GES) are known to improve perceptual quality of transients through a temporal re-shaping of the output signal or parts thereof in the decoder. However, these techniques do not perform temporal flatting prior to the encoder. Hence, the transient signal still enters the encoder with its original short time dynamics and imposes a high bitrate demand on the encoders bit budget.