For the transmission of digital audio signals over telecommunication networks, whether they are fixed or mobile networks for example, or for the storage of the signals, compression (or source coding) processes are used that implement coding systems which are generally of the linear predication time coding or transform frequency coding type.
The field of application of the method and the device that are the subjects of the invention is therefore the compression of the sound signals, in particular the digital audio signals coded by frequency transform.
FIG. 1 represents, by way of illustration, a theoretical block diagram of the coding and the decoding of a digital audio signal by transform including an overlap/addition analysis-synthesis according to the prior art.
Some music sequences, such as percussions and certain speech segments such as the plosives (/k/, /t/, . . . ), are characterized by extremely abrupt onsets which are reflected by very rapid transitions and a very strong variation of the dynamic range of the signal in the space of a few samples. One example of transition is given in FIG. 1 based on the sample 410.
For the coding/decoding processing, the input signal is decomposed into blocks of samples of length L whose boundaries are represented in FIG. 1 by vertical dotted lines. The input signal is denoted x(n), in which n is the index of the sample. The breakdown into successive blocks (or frames) leads to the definition of the blocks XN(n)=[x(N·L) . . . x(N·L+L−1)]=[xN(0) . . . xN(L−1)], where N is the index of the block (or of the frame), L is the length of the frame. In FIG. 1, there are L=160 samples. In the case of the modified discrete cosine transform MDCT, two blocks XN(n) and XN+1(n) are analyzed jointly to give a block of transformed coefficients associated with the frame of index N and the analysis window is sinusoidal.
The division into blocks, also called frames, applied by the transform coding is totally independent of the sound signal and the transitions can therefore appear at any point of the analysis window. Now, after transform decoding, the reconstructed signal is affected by “noise” (or distortion) generated by the quantization (Q)− inverse quantization (Q−1) operation. This coding noise is temporarily distributed relatively uniformly over all the temporal support of the transformed block, that is to say over the entire length of the window of length 2L of samples (with overlap of L samples). The energy of the coding noise is generally proportional to the energy of the block and is a function of the coding/decoding bit rate.
For a block including an onset (like the block 320-480 of FIG. 1), the energy of the signal is high, the noise is therefore also of high level.
In transform coding, the level of the coding noise is typically lower than that of the signal for the high energy segments which immediately follow the transition, but the level is higher than that of the signal for the lower energy segments, in particular over the part preceding the transition (samples 160-410 of FIG. 1). For the abovementioned part, the signal-to-noise ratio is negative and the resulting degradation can appear very disturbing in the listening. The coding noise prior to the transition is called pre-echo and the noise following the transition is called post-echo.
It can be seen in FIG. 1 that the pre-echo affects the frame preceding the transition and the frame where the transition occurs.
Psycho-acoustic experiments have demonstrated that the human ear performs a temporal pre-masking of the sounds that is fairly limited, of the order of a few milliseconds. The noise preceding the onset, or pre-echo, is audible when the duration of the pre-echo is greater than the pre-masking duration.
The human ear also performs a post-masking of a longer duration, from 5 to 60 milliseconds, upon the transition from high-energy sequences to low-energy sequences. The rate or level of disturbance that is acceptable for the post-echos is therefore greater than for the pre-echos.
The pre-echo phenomenon, more critical, is all the more disturbing when the length of the blocks in terms of number of samples is great. Now, in transform coding, it is well known that, for the standing signals, the more the length of the transform increases, the greater the coding gain. At a fixed sampling frequency and at a fixed bit rate, if the number of points of the window (therefore the length of the transform) is increased, there will be more bits per frame to code the frequency rays deemed useful by the physchoacoustical model, hence the advantage of using blocks of great length. The MPEG AAC (Advanced Audio Coding) coding, for example, uses a window of great length which contains a fixed number of samples, 2048, i.e. over a duration of 64 ms if the sampling frequency is 32 kHz; the problem of the pre-echos is managed therein by making it possible to switch from these long windows to 8 short windows through intermediate windows (called transition windows), which necessitates a certain delay in the coding to detect the presence of a transition and adapt the windows. The length of these short windows is therefore 256 samples (8 ms at 32 kHz). At low bit rate, it is still possible to have an audible pre-echo of a few ms. The switching of the windows makes it possible to attenuate the pre-echo, but not to eliminate it. The transform coders used for the conversational applications, such as ITU-T G.722.1, G.722.1C or G.719, often used a frame length of 20 ms and a window of 40 ms duration at 16, 32 or 48 kHz (respectively). It can be noted that the ITU-T G.719 coder incorporates a window switching mechanism with transient detection, but the pre-echo is not completely reduced at low bit rate (typically at 32 Kbit/s).
In order to reduce the abovementioned disturbing effect of the pre-echo phenomenon, various solutions have been proposed in the coder and/or the decoder.
The window switching has already been cited; it necessitates transmitting an auxiliary information item to identify the type of windows used in the current frame. Another solution consists in applying an adaptive filtering. In the zone preceding the onset, the reconstructed signal is seen as the sum of the original signal and of the quantization noise.
A corresponding filtering technique has been described in the article entitled High Quality Audio Transform Coding at 64 Kbit/s, IEEE Trans. on Communications Vol 42, No. 11, November 1994, published by Y. Mahieux and J. P. Petit.
The implementation of such a filtering requires knowledge of parameters of which some, like the prediction coefficients and the variance of the signal corrupted by the pre-echo, are estimated in the decoder from noisy samples. However, information such as the energy of the original signal can be known only to the coder and must consequently be transmitted. This entails transmitting additional information, which, at constrained bit rate, reduces the relative budget allocated to the transform coding. When the received block contains an abrupt variation of the dynamic range, the filtering processing is applied to it.
The abovementioned filter process does not make it possible to restore the original signal, but provides a strong reduction of the pre-echos. It does however entail transmitting the additional parameters to the decoder.
Unlike the above solutions, various pre-echo reduction techniques without specific transmission of the information have been proposed. For example, a review of the reduction of pre-echos in the context of hierarchical coding is presented in the article by B. Kövesi, S. Ragot, M. Gartner, H. Taddei, entitled “Pre-echo reduction in the ITU-T G.729.1 embedded coder,” EUSIPCO, Lausanne, Switzerland, August 2008.
A typical example of pre-echo attenuation processing method without auxiliary information is described in the French patent application FR 08 56248. In this example, attenuation factors are determined for each sub-block, in the low-energy sub-blocks preceding a sub-block in which a transition or onset has been detected.
The attenuation factor g(k) in the kth sub-block is calculated for example as a function of the ratio R(k) between the energy of the highest energy sub-block and the energy of the kth sub-block concerned:g(k)=f(R(k))in which f is a decreasing function with values between 0 and 1 and k is the number of the sub-block. Other definitions of the factor g(k) are possible, for example as a function of the energy En(k) in the current sub-block and of the energy En(k−1) in the preceding sub-block.
If the energy of the sub-blocks varies little relative to the maximum energy in the sub-blocks considered in the current frame, no attenuation is then necessary; the factor g(k) is set at an attenuation value inhibiting the attenuation, that is to say 1. Otherwise, the attenuation factor lies between 0 and 1.
In most cases, above all when the pre-echo is disturbing, the frame which precedes the pre-echo frame has a uniform energy which corresponds to the energy of a low-energy segment (typically a background noise). From experiments, it is neither useful nor even desirable for, after pre-echo attenuation processing, the energy of the signal to become lower than the average energy (per sub-block) of the signal preceding the processing zone—typically that of the preceding frame, denoted En, or that of the second half of the preceding frame, denoted En′.
For the sub-block of index k to be processed, the limit value, denoted limg(k), of the attenuation factor can be calculated in order to obtain exactly the same energy as the average energy per sub-block of the segment preceding the sub-block to be processed. This value is of course limited to a maximum of 1 since it is the attenuation values that are of interest here. More specifically, the following is defined here:
            lim      g        ⁢          (      k      )        =      min    ⁡          (                                                  max              ⁡                              (                                                      En                    _                                    ,                                                            En                      ′                                        _                                                  )                                                    En              ⁡                              (                k                )                                              ,          1                    )      in which the average energy of the preceding segment is approximated by the value max (En,En′).
The limg(k) value thus obtained serves as a lower limit in the final calculation of the attenuation factor of the sub-block, it is therefore used as follows:g(k)=max(g(k),limg(k))
The attenuation factors (or gains) g(k) determined for the sub-blocks can then be smoothed by a smoothing function applied sample-by-sample to avoid abrupt variations of the attenuation factor at the boundaries of the blocks.
For example, the gain per sample can first of all be defined as a piecewise constant function:gpre(n)=g(k), n=kL′, . . . , (k+1)L′−1in which L′ represents the length of a sub-block.
The function is then smoothed according to the following equation:gpre(n):=αgpre(n−1)+(1−α)gpre(n), n=0, . . . , L−1with the convention that gpre(−1) is the last attenuation factor obtained for the last sample of the preceding sub-block, α is the smoothing coefficient, typically α=0.85.
Other smoothing functions are also possible such as, for example, the linear cross-fade over u samples:
                    g        pre            ⁡              (        n        )              =                  1        u            ⁢                        ∑                      i            =            0                                u            -            1                          ⁢                                  ⁢                                            g              pre              ′                                                                      ⁡                      (                          n              -              i                        )                                ,          ⁢      n    =    0    ,  …  ⁢          ,      L    -    1  in which gpre′(n) is the non-smooth attenuation and gpre(n) is the smoothed attenuation, gpre′(n) with n=−(u−1), . . . , −1 are the last u−1 attenuation factors obtained for the last samples of the preceding sub-block. u=5 can for example be taken.
Once the factors gpre(n) have thus been calculated, the attenuation of pre-echos is done on the reconstructed signal in the current frame, xrec(n), by multiplying each sample by the corresponding factor:xrec,g(n)=gpre(n)xrec(n), n=0, . . . , L−1in which xrec,g(n) is the signal decoded and post-processed by the pre-echo reduction.FIGS. 2 and 3 illustrate the implementation of the attenuation method as described in the prior art patent application, mentioned above and summarized previously.
In these examples, the signal is sampled at 32 kHz, the length of the frame is L=640 samples and each frame is divided into 8 sub-blocks of K=80 samples.
In the part a) of FIG. 2, a frame of an original signal sampled at 32 kHz is represented. An onset (or transition) in the signal is situated in the sub-block commencing with the index 320. This signal has been coded by a transform coder of MDCT type at low bit rate (24 Kbit/s).
In the part b) of FIG. 2, the result of the decoding without pre-echo processing is illustrated. The pre-echo from the sample 160 can be observed, in the sub-blocks preceding the one containing the onset.
The part c) shows the trend of the pre-echo attenuation factor (continuous line) obtained by the method described in the abovementioned prior art patent application. The dotted line represents the factor before smoothing. Note here that the position of the onset is estimated around the sample 380 (in the block delimited by the samples 320 and 400).
The part d) illustrates the result of the decoding after application of the pre-echo processing (multiplication of the signal b) with the signal c)). It can be seen that the pre-echo has indeed been attenuated. FIG. 2 shows also that the smoothed factor does not go back to 1 at the moment of the onset, which implies a reduction of the amplitude of the onset. The perceptible impact of this reduction is very low but can nevertheless be avoided. FIG. 3 illustrates the same example as FIG. 2, in which, before smoothing, the attenuation factor value is forced to 1 for the few samples of the sub-block preceding the sub-block where the onset is situated. The part c) of FIG. 3 gives an example of such a correction.
In this example, the factor value 1 has been assigned to the last 16 samples of the sub-block preceding the onset, from the index 364. Thus, the smoothing function progressively increases the factor to have a value close to 1 at the moment of the onset. The amplitude of the onset is then preserved, as illustrated in the part d) of FIG. 3, but a few pre-echo samples are not attenuated.
In the example of FIG. 3, the reduction of pre-echo by attenuation does not make it possible to reduce the pre-echo to the level of the onset, because of the smoothing of the gain.
This pre-echo reduction technique can however be perfected for some types of signals such as modern music signals for example. In effect, in some cases, a false pre-echo detection can take place. FIG. 4 illustrates an example of such an original signal, uncoded and therefore without pre-echo. It is a beating of an electronic/synthetic percussion instrument. It can be seen here that, before the clear onset toward the index 1600, there is a synthetic noise which starts toward the index 1250. This synthetic noise which therefore forms part of the signal would be detected as a pre-echo by the pre-echo detection algorithm described above, assuming a perfect coding/decoding of the signal. The pre-echo attenuation processing would therefore eliminate this component of the signal. This would distort the decoded signal (when the coding/decoding is perfect), which is not desirable.
There is therefore a need for an enhanced technique for discriminating and attenuating pre-echos in decoding, which makes it possible to make the detection of the pre-echos reliable and avoid the false detections without any auxiliary information being transmitted by the coder.