FIG. 1 represents, by way of illustration, a schematic diagram of the coding and of the decoding of a digital audio signal by transform including an analysis-synthesis by overlap-addition according to the prior art.
Certain musical sequences, such as percussions and certain speech segments like the plosive consonants (/k/, /t/, etc.) are characterized by extremely abrupt onsets which are reflected in very rapid transitions and a very strong variation of the dynamic range of the signal in the space of a few samples. An exemplary transition is given in FIG. 1 from the sample 410.
For the coding/decoding processing, the input signal is subdivided into blocks of samples of length L, the boundaries of which are represented in FIG. 1 by vertical dotted lines. The input signal is denoted x(n), where n is the index of the sample. The breakdown into successive blocks (or frames) results in the definition of the blocks XN(n)=[x(N·L) . . . x(N·L+L−1)]=[xN(0) . . . xN(L−1)], where N is the index of the block (or of the frame), L is the length of the frame. In FIG. 1, L=160 samples. In the case of the modified discrete cosine transform MDCT, two blocks XN(n) and XN+1(n) are analyzed jointly to give a block of transformed coefficients associated with the frame of index N and the analysis window is sinusoidal.
The division into blocks, also called frames, applied by the transform coding is totally independent of the sound signal and the transitions can therefore appear at any point of the analysis window. Now, after transform decoding, the reconstructed signal is affected by “noise” (or distortion) caused by the quantization (Q)-inverse quantization (Q−1) operation. This coding noise is distributed in time in a relatively uniform manner over the entire time medium of the transformed block, that is to say over the entire length of the window of length 2L of samples (with overlap of L samples). The energy of the coding noise is generally proportional to the energy of the block and is a function of the coding/decoding bit rate.
For a block comprising an onset (like the block 320-480 of FIG. 1) the energy of the signal is high, the noise is therefore also of high level.
In transform coding, the level of the coding noise is typically lower than that of the signal for the segments of high energy which immediately follow the transition, but the level is higher than that of the signal for the segments of lower energy, notably over the part preceding the transition (samples 160-410 of FIG. 1). For the abovementioned part, the signal-to-noise ratio is negative and the resulting degradation can appear very annoying when listening. Pre-echo is the name given to the coding noise prior to the transition and post-echo is the name given to the noise following the transition.
It can be seen in FIG. 1 that the pre-echo affects the frame preceding the transition as well as the frame where the transition occurs.
Psycho-acoustic experiments have shown that the human ear performs a temporal pre-masking of the sounds that is fairly limited, of the order of a few milliseconds. The noise preceding the onset, or pre-echo, is audible when the duration of the pre-echo is greater than the pre-masking duration.
The human ear also performs a post-masking of a longer duration, from 5 to 60 milliseconds, in the transition from sequences of high energy to sequences of low energy. The rate or level of discomfort that is acceptable for the post-echoes is therefore higher than for the pre-echoes.
The phenomenon of the pre-echoes, more critical, is all the more annoying when the length of the blocks in terms of number of samples is significant. Now, in transform coding, it is well known that for the stationary signals, the more the length of the transform increases, the greater the coding gain becomes. With fixed sampling frequency and with fixed bit rate, if the number of points of the window (therefore the length of the transform) is increased, there will be more bits per frame to code the frequency rays deemed useful by the psycho-acoustic model, hence the benefit of using blocks of great length. The MPEG AAC (Advanced Audio Coding) coding, for example, uses a window of great length which contains a fixed number of samples, 2048, i.e. over a duration of 64 ms if the sampling frequency is 32 kHz; the problem of the pre-echoes is managed there by making it possible to switch from these long windows to 8 short windows through the intermediate windows (called transition windows), which requires a certain delay in the coding to detect the presence of a transition and adapt the windows. The length of these short windows is therefore 256 samples (8 ms at 32 kHz). At low bit rate, it is still possible to have an audible pre-echo of a few ms. The switching of the windows makes it possible to attenuate the pre-echo but not eliminate it. The transform coders used for the conversational applications, like ITU-T G.722.1, G.722.1C or G.719, often use a frame length of 20 ms and a window of 40 ms duration at 16, 32 or 48 kHz (respectively). It can be noted that the ITU-T G.719 coder incorporates a window switching mechanism with transient detection, but the pre-echo is not completely reduced at low bit rate (typically at 32 kbit/s).
In order to reduce the abovementioned annoying effect of the pre-echo phenomenon, different solutions have been proposed at the coder and/or decoder level.
The switching of windows has already been cited; it entails transmitting auxiliary information to identify the type of windows used in the current frame. Another solution consists in applying an adaptive filtering. In the zone preceding the onset, the reconstructed signal is seen as the sum of the original signal and of the quantization noise.
A corresponding filtering technique has been described in the article entitled High Quality Audio Transform Coding at 64 kbits, IEEE Trans. on Communications Vol 42, No. 11, November 1994, published by Y. Mahieux and J. P. Petit.
The implementation of such filtering entails the knowledge of parameters, some of which, like the prediction coefficients and the variance of the signal corrupted by the pre-echo, are estimated on the decoder from noisy samples. By contrast, the information such as the energy of the original signal can be known only to the coder and must consequently be transmitted. This entails transmitting additional information, which, with constrained bit rate, reduces the relative budget allocated to the transform coding. When the received block contains an abrupt variation of dynamic range, the filtering processing is applied to it.
The abovementioned filtering process does not make it possible to retrieve the original signal, but provides a strong reduction of the pre-echoes. It does however entail transmitting the additional parameters to the decoder.
Unlike the preceding solutions, different pre-echo reduction techniques without specific transmission of the information have been proposed. For example, a review of the reduction of pre-echoes in the context of hierarchical coding is presented in the article by B. Kovesi, S. Ragot, M. Gartner, H. Taddei, “Pre-echo reduction in the ITU-T G.729.1 embedded coder,” EUSIPCO, Lausanne, Switzerland, August 2008.
A typical example of pre-echo attenuation method without auxiliary information is described in the French patent application FR 08 56248. In this example, attenuation factors are determined per sub-block, in the sub-blocks of low energy preceding a sub-block in which a transition or onset has been detected.
The attenuation factor g(k) in the kth sub-block is computed for example as a function of the ratio R(k) between the energy of the sub-block of strongest energy and the energy of the kth sub-block concerned:g(k)=ƒ(R(k))where ƒ is a decreasing function with values between 0 and 1 and k is the number of the sub-block. Other definitions of the factor g(k) are possible, for example as a function of the energy En(k) in the current sub-block and of the energy En(k−1) in the preceding sub-block.
If the energy of the sub-blocks varies little relative to the maximum energy in the sub-blocks considered in the current frame, no attenuation is then necessary; the factor g(k) is set at an attenuation factor inhibiting the attenuation, that is to say 1. Otherwise, the attenuation factor lies between 0 and 1.
In most cases, above all when the pre-echo is annoying, the frame which precedes the pre-echo frame has a uniform energy which corresponds to the energy of a segment of low energy (typically a background noise). From experience, it is neither useful nor even desirable for, after pre-echo attenuation processing, the energy of the signal to become lower than the average energy (per sub-block) of the signal preceding the processing zone—typically that of the preceding frame, denoted En, or that of the second half of the preceding frame, denoted En′.
For the sub-block of index k to be processed, it is possible to compute the limit value, denoted limg(k), of the attenuation factor in order to obtain exactly the same energy as the average energy per sub-block of the segment preceding the sub-block to be processed. This value is of course limited to a maximum of 1 since it is the attenuation values that are of interest here. More specifically, the following is defined here:
            lim      g        ⁢          (      k      )        =      min    ⁡          (                                                  max              ⁡                              (                                                      En                    _                                    ,                                                            En                      _                                        ′                                                  )                                                    En              ⁡                              (                k                )                                                    ,        1            )      in which the average energy of the preceding segment is approximated by the value max(En,En′).
The value lime(k) that is thus obtained serves as lower limit in the final computation of the attenuation factor of the sub-block, and is therefore used as follows:g(k)=max(g(k),limg(k))
The attenuation factors (or gains) g(k) determined per sub-blocks can then be smoothed by a smoothing function applied sample by sample to avoid abrupt variations of the attenuation factor at the boundaries of the blocks.
For example, it is possible to first define the gain per sample as a piecewise constant function:gpre(n)=g(k),n=kL′, . . . ,(k+1)L′−1in which L′ represents the length of a sub-block.The function is then smoothed according to the following equation:gpre(n):=αgpre(n−1)+(1−α)gpre(n),n=0, . . . ,L−1with the convention that gpre(−1) is the last attenuation factor obtained for the last sample of the preceding sub-block, α is the smoothing coefficient, typically α=0.85.
Other smoothing functions are also possible such as, for example, linear cross-fading over u samples:
                    g        pre            ⁡              (        n        )              =                  1        u            ⁢                        ∑                      i            =            0                                u            -            1                          ⁢                                  ⁢                              g            pre            ′                    ⁡                      (                          n              -              i                        )                                ,      n    =    0    ,  …  ⁢          ,      L    -    1  in which gpre′(n) is the non-smoothed attenuation and gpre(n) is the smoothed attenuation, gpre′(n) with n=−(u−1), . . . , −1 are the last u−1 attenuation factors obtained for the last samples of the preceding sub-block. It is for example possible to take u=5.
Once the gpre(n) factors are thus computed, the attenuation of pre-echoes is done on the signal reconstructed in the current frame, xrec(n), by multiplying each sample by the corresponding factor:xrec,g(n)=gpre(n)xrec(n),n=0, . . . ,L−1where xrec,g(n) is the signal decoded and post processed by pre-echo reduction.
FIGS. 2 and 3 illustrate the implementation of the attenuation method as described in the abovementioned, and previously summarized, prior art patent application.
In these examples, the signal is sampled at 32 kHz, the length of the frame is L=640 samples and each frame is divided into 8 sub-blocks of K=80 samples.
In the part a) of FIG. 2, a frame of an original signal sampled at 32 kHz is represented. An onset (or transition) in the signal is located in the sub-block beginning at the index 320. This signal has been coded by a transform coder of MDCT type at low bit rate (24 kbit/s).
In the part b) of FIG. 2, the result of the decoding without pre-echo processing is illustrated. The pre-echo can be observed from the sample 160, in the sub-blocks preceding the one containing the onset.
The part c) shows the trend of the pre-echo attenuation factor (continuous line) obtained by the method described in the abovementioned prior art patent application. The dotted line represents the factor before smoothing. It should be noted here that the position of the onset is estimated around the sample 380 (in the block delimited by the samples 320 and 400).
The part d) illustrates the result of the decoding after application of the pre-echo processing (multiplication of the signal b) with the signal c)). It can be seen that the pre-echo has indeed been attenuated. FIG. 2 also shows that the smoothed factor does not go back to 1 at the time of the onset, which implies a decrease in the amplitude of the onset. The perceptible impact of this decrease is very small but can nevertheless be avoided. FIG. 3 illustrates the same example as FIG. 2, in which, before smoothing, the attenuation factor value is forced to 1 for the few samples of the sub-block preceding the sub-block where the onset is located. The part c) of FIG. 3 gives an example of such a correction.
In this example, the factor value 1 has been assigned to the last 16 samples of the sub-block preceding the onset, from the index 364. Thus, the smoothing function progressively increases the factor to have a value close to 1 at the time of the onset. The amplitude of the onset is then preserved, as illustrated in the part d) of FIG. 3, but a few pre-echo samples are not attenuated.
In the example of FIG. 3, the pre-echo reduction by attenuation does not make it possible to reduce the pre-echo to the level of the onset, because of the smoothing of the gain.
Another example with the same setting as that of FIG. 3 is illustrated in FIG. 4. This figure represents 2 frames to better show the nature of the signal before the onset. Here, the energy of the original signal before the onset is stronger (part a)) than in the case illustrated by FIG. 3, and the signal before the onset is audible (samples 0-850). In the part b) the pre-echo on the signal decoded without pre-echo processing can be observed in the 700-850 zone. According to the attenuation limiting procedure explained previously, the energy of the signal of the pre-echo zone is attenuated to the average energy of the signal preceding the processing zone. In the part c), it can be seen that the attenuation factor computed by taking account of the energy limitation is close to 1 and that the pre-echo is still present on the part d) after application of the pre-echo processing (multiplication of the signal b) with the signal c)), despite the correct leveling of the signal in the pre-echo zone. This pre-echo can in fact be clearly distinguished on the wave form where it can be seen that a high-frequency component is superposed on the signal in this zone.
This high-frequency component is clearly audible and annoying, and the onset is less clear (part d) FIG. 4).
The explanation of this phenomenon is as follows: in the case of a very abrupt, impulsive onset (as illustrated in FIG. 4), the spectrum of the signal (in the frame containing the onset) is more white and therefore also contains a lot of high frequencies. Thus, the quantization noise is also spread and relatively flat in frequencies (white) and made up of high frequencies, which is not the case of the signal preceding the pre-echo zone. There is therefore an abrupt change in the spectrum from one frame to the other, which results in an audible pre-echo despite the fact that the energy has been set to the correct level.
This phenomenon is again represented in FIGS. 5a and 5b which respectively show the spectrograms of the original signal in 5a, corresponding to the signal represented in part a) of FIG. 4, and the spectrogram of the signal with pre-echo attenuation according to the prior art, in 5b, corresponding to the signal represented in part d) of FIG. 4.
A still audible pre-echo can clearly be seen in the framed part in FIG. 5b. 
There is therefore a need for an improved technique for attenuating pre-echoes in decoding, which makes it possible to attenuate the undesirable high frequencies and, more generally, the spurious pre-echoes precisely and universally and without any auxiliary information being transmitted by the coder.