For the transport of digital audio signals over transmission networks, be they for example fixed or mobile networks, or for the storage of signals, use is made of compression processes (or source coding) implementing coding systems of the transform-based frequency coding or temporal coding type.
The method and the device, which are the subject of the invention, thus have as field of application the compression of sound signals, in particular, digital audio signals coded by frequency transform.
FIG. 1 represents by way of illustration, a basic diagram of the coding and of the decoding, of a digital audio signal by transform including an add/overlap analysis-synthesis according to the prior art.
Certain musical sequences, such as percussions and certain speech segments such as plosives (/k/, /t/, . . . ), are characterized by extremely abrupt attacks which result in very fast transitions and a very strong variation in the dynamic swing of the signal in the space of a few samples. An exemplary transition is given in FIG. 1 on the basis of the sample 410.
For the coding/decoding processing, the input signal is sliced into blocks of samples of length L (which are represented here by vertical dashed lines). The input signal is denoted x(n). The slicing into successive blocks leads to defining the blocks xN=[x(N.L) . . . x(N.L+L−1)]=[xN(0) . . . xN(L−1)], where N is the index of the frame and L is the length of the frame. In FIG. 1 we have L=160 samples. In the case of the modified cosine modulated transform MDCT (for “Modified Discrete Cosine Transform”), two blocks xN(n) and xN+1(n) are analyzed jointly to give a block of transformed coefficients associated with the frame of index N.
The division into blocks, also called frames, carried out by the transform coding is totally independent of the sound signal and the transitions therefore appear at any point of the analysis window. Now, after transform decoding, the reconstructed signal is marred by “noise” (or distortion) produced by the quantization (Q)-inverse quantization (Q−1) operation. This coding noise is distributed temporally in a relatively uniform manner over the whole of the temporal support of the transformed block, that is to say over the whole of the length of the window of length 2 L of samples (with overlap of L samples). The energy of the coding noise is in general proportional to the energy of the block and is dependent on the decoding rate.
For a block comprising an attack (such as the block 320-340 of FIG. 1) the energy of the signal is high, the noise is therefore also of high level.
In transform coding, the level of the coding noise is below that of the signal for the samples of high energy which immediately follow the transition, but the level is above that of the signal for the samples of lower energy, especially over the part preceding the transition (samples 160-410 of FIG. 1). For the aforementioned part, the signal-to-noise ratio is negative and the resulting degradation can appear very annoying during listening. The coding noise before transition is called pre-echo and the noise after transition is called post-echo.
It may be observed in FIG. 1 that the pre-echo affects the frame preceding the transition as well as the frame where the transition occurs.
Psycho-acoustic experiments have shown that the human ear performs fairly limited temporal pre-masking of sounds, of the order of a few milliseconds. The noise preceding the attack, or pre-echo, is audible when the duration of the pre-echo is greater than the duration of the pre-masking.
The human ear also performs post-masking of a longer duration, from 5 to 60 milliseconds, when switching from high-energy sequences to low-energy sequences. The acceptable degree or level of annoyance for the post-echoes is therefore greater than for the pre-echoes.
The more critical phenomenon of pre-echoes is all the more annoying the greater the length of the blocks in terms of number of samples. Now, in transform coding, it is necessary to have a faithful resolution of the most significant frequency zones. At fixed sampling frequency and at fixed rate, if the number of points of the window is increased, more bits will be available for coding the frequency spectral lines deemed useful by the psycho acoustic model, hence the advantage of using blocks of large length. The MPEG AAC coding (Advanced Audio Coding), for example, uses a window of large length which contains a fixed number of samples, 2048, i.e. over a duration of 64 ms at a sampling frequency of 32 kHz. The transform coders used for conversational applications often use a window of duration 40 ms at 16 kHz and a frame renewal duration of 20 ms.