FIG. 1 represents by way of illustration, a basic diagram of the transform-based coding and decoding of a digital audio signal including an analysis-synthesis by addition/overlap according to the prior art.
Certain musical sequences, such as percussions and certain speech segments such as the plosives (/k/, /t/, . . . ), are characterized by extremely abrupt attacks which are manifested by very fast transitions and a very strong variation of the dynamics of the signal within the space of a few samples. An exemplary transition is given in FIG. 1 onwards of sample 410.
For the coding/decoding processing, the input signal is split up into blocks of samples of length L, represented in FIG. 1 by dotted vertical lines. The input signal is denoted x(n), where n is the index of the sample. The slicing into successive blocks leads to the blocks being defined by XN(n)=[x(N·L) . . . x(N·L+L−1)]=[xN(0) . . . xN(L−1)], where N is the index of the frame, and L is the length of the frame. In FIG. 1 we have L=160 samples. In the case of the modified cosine modulated transform MDCT (for “Modified Discrete Cosine Transform”), two blocks XN(n) and XN+1(n) are analyzed jointly to give a block of transformed coefficients associated with the frame of index N.
The division into blocks, also called frames, operated by the transform-based coding is totally independent of the sound signal and the transitions can therefore appear at any point of the analysis window. Now, after transform-based decoding, the reconstructed signal is marred by “noise” (or distortion) engendered by the quantization (Q)-inverse quantization (Q−1) operation. This coding noise is distributed temporally in a relatively uniform manner over the whole of the temporal support of the transformed block, that is to say over the whole length of the window of length 2 L of samples (with overlap of L samples). The energy of the coding noise is in general proportional to the energy of the block and is dependent on the coding/decoding bitrate.
For a block comprising an attack (such as the block 320-480 of FIG. 1) the energy of the signal is high, the noise is therefore also of high level.
In transform-based coding, the level of the coding noise is typically below that of the signal for the high-energy segments which immediately follow the transition, but the level is above that of the signal for the segments of lower energy, especially over the part preceding the transition (samples 160-410 of FIG. 1). For the aforementioned part, the signal-to-noise ratio is negative and the resulting degradation can appear very annoying during listening. The coding noise prior to the transition is called pre-echo and the noise posterior to the transition is called post-echo.
It may be observed in FIG. 1 that the pre-echo affects the frame preceding the transition as well as the frame where the transition occurs.
Psycho-acoustic experiments have shown that the human ear performs fairly limited, of the order of a few milliseconds, temporal pre-masking of sounds. The noise preceding the attack, or pre-echo, is audible when the duration of the pre-echo is greater than the duration of the pre-masking.
The human ear also performs a post-masking of a longer duration, from 5 to 60 milliseconds, when passing from high-energy sequences to low energy sequences. The rate or level of annoyance which is acceptable for the post-echoes is therefore bigger than for the pre-echoes.
The phenomenon of pre-echoes, which is more critical, is all the more annoying the bigger the length of the blocks in terms of number of samples. Now, in transform-based coding, it is well known that for stationary signals the more the length of the transform increases, the bigger the coding gain. At fixed sampling frequency and fixed bitrate, if the number of points of the window (therefore the length of the transform) is increased, more bits per frame will be available to code the frequency spectral lines deemed useful by the psychoacoustic model, hence the advantage of using blocks of large length. MPEG AAC coding (Advanced Audio Coding), for example, uses a window of large length which contains a fixed number of samples, 2048, i.e. over a duration of 64 ms at a sampling frequency of 32 kHz; the problem of pre-echoes is managed therein by making it possible to switch from these long windows to 8 short windows by way of intermediate (transition) windows, thereby requiring a certain delay on coding to detect the presence of a transition and adapt the windows. The length of these short windows is therefore 8 ms. At low bitrate it is always possible to have an audible pre-echo of a few ms. Switching the windows makes it possible to attenuate the pre-echo but not to remove it. The transform-based coders used for conversational applications such as UIT-T G.722.1, G.722.1C or G.719 often use a window of duration 40 ms at 16, 32 or 48 kHz (respectively) and a frame length of 20 ms. It may be noted that the UIT-T G.719 coder integrates a mechanism for switching windows with transient detection, however the pre-echo is not completely reduced at low bitrate (typically 32 kbit/s).
With the aim of reducing the aforementioned annoying effect of the phenomenon of pre-echoes, various solutions have been proposed at the coder and/or decoder level.
The switching of windows was cited above. Another solution consists in applying an adaptive filtering. In the zone preceding the attack, the reconstructed signal is viewed as the sum of the original signal and of the quantization noise.
A corresponding filtering technique has been described in the article entitled High Quality Audio Transform Coding at 64 kbits, IEEE Trans. on Communications Vol 42, No. 11, November 1994, published by Y. Mahieux and J. P. Petit.
The implementation of such filtering requires the knowledge of parameters, some of which, like the prediction coefficients and the variance of the signal corrupted by the pre-echo, are estimated at the decoder on the basis of the noisy samples. On the other hand, information such as the energy of the original signal can be known only at the coder and must consequently be transmitted. This makes it necessary to transmit additional information, which at constrained bitrate decreases the relative budget allocated to the transform-based coding. When the block received contains an abrupt variation in dynamic, the filtering processing is applied to it.
The aforementioned filtering process does not make it possible to retrieve the original signal, but affords a large reduction in the pre-echoes. However, it requires that the additional parameters be transmitted to the decoder.
Various pre-echo reduction techniques without specific transmission of information have been proposed. For example, a review of the reduction of pre-echoes in the context of hierarchical coding is presented in the article B. Kövesi, S. Ragot, M. Gartner, H. Taddei, “Pre-echo reduction in the ITU-T G.729.1 embedded coder,” EUSIPCO, Lausanne, Switzerland, August 2008.
A typical example of a method of attenuating pre-echoes is described in French patent application FR 08 56248. In this example, attenuation factors are determined per sub-block, in the low-energy sub-blocks preceding a sub-block in which a transition or attack has been detected.
The attenuation factor per sub-block g(k) is calculated for example as a function of the ratio R(k) of the energy of the sub-block of highest energy to the energy of the k-th sub-block in question:g(k)=ƒ(R(k))where ƒ is a decreasing function with values between 0 and 1 and k is the sub-block number. Other definitions of the factor g(k) are possible, for example as a function of the energy En(k) in the current sub-block and of the energy En(k−1) in the previous sub-block.
If the variation of the energy with respect to the maximum energy is low, no attenuation is then necessary. The factor g(k) is then fixed at an attenuation value which inhibits attenuation, that is to say 1. Otherwise, the attenuation factor lies between 0 and 1.
In most cases, especially when the pre-echo is annoying, the frame which precedes the pre-echo frame has a homogeneous energy which corresponds to the energy of a segment of low energy (typically, background noise). According to experiment it is not useful nor even desirable that after the pre-echo attenuation processing the energy of the signal should be below the average energy per sub-block of the signal preceding the processing zone (typically that of the previous frame En or that of the second half of the previous frame En′).
For the sub-block k to be processed it is possible to calculate the limit value of the factor limg(k) so as to obtain exactly the same energy as the average energy per sub-block of the segment preceding the sub-block to be processed. This value is of course limited to a maximum of 1 since we are concerned here with the attenuation values. More precisely:
            lim      g        ⁢          (      k      )        =      min    ⁡          (                                                  max              ⁡                              (                                                      En                    _                                    ,                                                            En                      _                                        ′                                                  )                                                    En              ⁡                              (                k                )                                                    ,        1            )      where the average energy of the previous segment is approximated by max (En, En′).
The value limg(k) thus obtained serves as lower limit in the final calculation of the sub-block attenuation factor:g(k)=max(g(k),limg(k))
The attenuation factors (or gains) g(k) determined per sub-block are thereafter smoothed by a smoothing function applied sample by sample to avoid abrupt variations of the attenuation factor at the boundaries of the blocks.
For example, it is firstly possible to define the gain per sample as a piecewise constant function:gpre(n)=g(k),n=kL′, . . . ,(k+1)L′−1where L′ represents the length of a sub-block.The function is thereafter smoothed according to the following equation:gpre(n):=αgpre(n−1)+(1−α)gpre(n),n=0, . . . ,L−1with the convention that gpre(−1) is the last attenuation factor obtained for the last sample of the previous sub-block, and α is the smoothing coefficient, typically α=0.85.
Other smoothing functions are also possible. Once the factors gpre(n) have been calculated thus, the pre-echo attenuation is carried out on the reconstructed signal of the current frame, xrec(n), by multiplying each sample by the corresponding factor:xrec,g(n)=gpre(n)xrec(n),n=0, . . . ,L−1where xrec,g(n) is the signal decoded and post-processed by the pre-echo reduction.
FIGS. 2 and 3 illustrate the implementation of the attenuation method as described in the aforementioned patent application of the prior art and as summarized above.
In these examples the signal is sampled at 32 kHz, the length of the frame is L=640 samples and each frame is divided into 8 sub-blocks of K=80 samples.
In part a) of FIG. 2, a frame of an original signal sampled at 32 kHz, is represented. An attack (or transition) in the signal is situated in the sub-block beginning at the index 320. This signal has been coded by a transform-based coder of low-bitrate (24 kbit/s) MDCT type.
In part b) of FIG. 2, the result of the decoding without pre-echo processing is illustrated. It is possible to observe the pre-echo onwards of sample 160, in the sub-blocks preceding the one containing the attack.
Part c) shows the evolution of the pre-echo attenuation factor (continuous line) obtained by the method described in the aforementioned patent application of the prior art. The dashed line represents the factor before smoothing. It is noted here that the position of the attack is estimated around sample 380 (in the block delimited by samples 320 and 400).
Part d) illustrates the result of the decoding after application of the pre-echo processing (multiplication of the signal b) with the signal c)). It is seen that the pre-echo has indeed been attenuated. FIG. 2 also shows that the smoothed factor does not go back to 1 at the moment of the attack, thus implying a decrease in the amplitude of the attack. The perceptible impact of this decrease is very small but can nonetheless be avoided. FIG. 3 illustrates the same example as FIG. 2, in which, before smoothing, the attenuation factor value is forced to 1 for the few samples of the sub-block preceding the sub-block where the attack is situated. Part c) of FIG. 3 gives an example of such a correction.
In this example the factor value 1 has been assigned to the last 16 samples of the sub-block preceding the attack, onwards of the index 364. Thus the smoothing function progressively increases the factor so that it has a value close to 1 at the moment of the attack. The amplitude of the attack is then preserved, as illustrated in part d) of FIG. 3, on the other hand a few pre-echo samples are not attenuated.
In the example of FIG. 3 the pre-echo reduction by attenuation does not make it possible to reduce the pre-echo until as far as the level of the attack, because of the smoothing of the gain.
Another example with the same setting as that of FIG. 3 is illustrated in FIG. 4. This figure represents 2 frames so as to better show the nature of the signal before the attack. Here, the energy of the original signal before the attack is higher (part a)) than in the case illustrated by FIG. 3, and the signal before the attack is audible (samples 0-850). In part b) it is possible to observe the pre-echo on the decoded signal without pre-echo processing in the zone 700-850. According to the procedure for limiting the attenuation explained previously, the energy of the signal of the pre-echo zone is attenuated as far as the average energy of the signal preceding the processing zone. It is observed in part c) that the attenuation factor calculated by taking account of the energy limitation is close to 1 and that the pre-echo is still present in part d) after application of the pre-echo processing (multiplication of the signal b) with the signal c)), despite the fact that the signal has been set to the right level in the pre-echo zone. It is indeed possible to clearly distinguish this pre-echo on the waveform where it is noted that a high-frequency component is superimposed on the signal in this zone.
This high-frequency component is clearly audible and annoying, and the attack is not as sharp (part d) FIG. 4).
The explanation for this phenomenon is the following: in the case of a very abrupt, impulsive attack (as illustrated in FIG. 4) the spectrum of the signal (in the frame containing the attack) is rather white and therefore also contains many high frequencies. Thus the quantization noise is also white and composed of high frequencies, this not being the case for the signal preceding the pre-echo zone. There is therefore an abrupt change in the spectrum from one frame to the other, which results in an audible pre-echo despite the fact that the energy has been set to the right level.
This phenomenon is again represented in FIGS. 5a and 5b which show respectively the spectrograms of the original signal at 5a, corresponding to the signal represented in part a) of FIG. 4 and the spectrogram of the signal with attenuation of pre-echoes according to the prior art, at 5b, corresponding to the signal represented in part d) of FIG. 4.
A still audible pre-echo in the part outlined in FIG. 5b is clearly noted.
There therefore exists a need for a technique for improved attenuation of pre-echoes on decoding, which makes it possible to also attenuate the undesirable high frequencies or spurious pre-echoes, doing so without any auxiliary information being transmitted by the coder.