1. Perceptual Encoding with Transmission of a Masking Curve
1.1 Audio Compression and Quantization
Audio compression is often based on certain auditory capacities of the human ear. The encoding and quantization of an audio signal often takes account of this characteristic. The term used in this case is “perceptual encoding” or encoding according to a psycho-acoustic model of the human ear.
The human ear is incapable of separating two components of a signal emitted at proximate frequencies as well as in a limited time slot. This property is known as auditory masking. Furthermore, the ear has an auditory or hearing threshold, in peaceful surroundings, below which no sound emitted will be perceived. The level of this threshold varies according to the frequency of the sound wave.
In the compression and/or transmission of audio-digital signals, it is sought to determine a number of quantization bits to quantize the spectral components that form the signal, without introducing excessive quantization noise and thus impairing the quality of the encoded signal. The goal generally is to reduce the number of quantization bits so as to obtain efficient compression of the signal. What has to be done therefore is to find a compromise between sound quality and the level of compression of the signal.
In the classic prior art techniques, the principles of quantization thus use a masking threshold induced by the human ear and the masking property to determine the maximum amount of quantization noise acceptable for injection into the signal without its being perceived by the ear when the audio signal is rendered, i.e. without introducing any excessive distortion.
1.2 Perceptual Audio Transform Encoding
For an exhaustive description of audio transform encoding, cf. Jayant, Johnson and Safranek, “Signal Compression Based on Method of Human Perception,” Proc. Of IEEE, Vol. 81, No. 10, pp. 1385-1422, October 1993.
This technique makes use of the frequency masking model of the ear illustrated in FIG. 1, which presents an example of a representation of the frequency of an audio signal and the masking threshold for the ear. The x-axis 10 represents the frequencies f in Hz and the y-axis 11 represents the sound intensity I in dB. The ear breaks down the spectrum of a signal x(t) into critical bands 120, 121, 122, 123 in the frequency domain on the Bark scale. The critical band 120 indexed n of the signal x(t) having energy En then generates a mask 13 within the band indexed n and in the neighboring critical bands 122 and 123. The associated masking threshold 13 is proportional to the energy En of the “masking” component 120 and is decreasing for the critical bands with indices below and above n.
The components 122 and 123 are masked in the example of FIG. 1. Furthermore, the component 121 too is masked since it is situated below the absolute threshold of hearing 14. A total masking curve is then obtained, by combination of the absolute threshold of hearing 14 and of masking thresholds associated with each of the components of the audio signal x(t) analyzed in critical bands. This masking curve represents the spectral density of maximum quantization noise that can be superimposed on the signal, when it is encoded, without its being perceptible to the human ear. A quantization interval profile, also loosely called an injected noise profile, is then put into shape during the quantization of the spectral coefficients coming from the frequency transform of the source audio signal.
FIG. 2 is a flow chart illustrating the principle of a classic perceptual encoder. A temporal source audio signal x(t) is transformed in the frequency domain by a time-frequency transform bloc 20. A spectrum of the source signal, formed by spectral coefficients Xn is then obtained. It is analyzed by a psycho-acoustic model 21 which has the role of determining the total masking curve C of the signal as a function of the absolute threshold of hearing as well as the masking thresholds of each spectral component of the signal. The masking curve obtained can be used to know the quantity of quantization noise that can be injected and therefore to determine the number of bits to be used to quantify the spectral coefficients or samples. This step for determining the number of bits is performed by a binary allocation block 22 which delivers a quantization interval profile Δn for each coefficient Xn. The binary allocation bloc seeks to attain the target bit rate by adjusting the quantization intervals with the shaping constraint given by the masking curve C. The quantization intervals Δn are encoded in the form of scale factors F especially by this binary allocation block 22 and are then transmitted as ancillary information in the bit stream T.
A quantization block 23 receives the spectral coefficients Xn as well as the determined quantization intervals Δn, and then delivers quantized coefficients {circumflex over (X)}n.
Finally, an encoding and bit stream forming block 24 centralizes the quantized spectral coefficients {circumflex over (X)}n and the scale factors F, and then encodes them and thus forms a bit stream containing the payload data on the encoded source audio signal as well as the data representative of the scale factors.
2. Hierarchical Building of the Masking Curves
A description is provided here below of the drawbacks of the prior art in the context of hierarchical encoding of audio-digital data. However, an embodiment of the invention can be applied to all types of encoders of audio-digital signals, implementing a quantization based on the psycho-acoustic model of the ear. These encoders are not necessarily hierarchical.
Hierarchical coding entails the cascading of several stages of encoders. The first stage generates the encoded version at the lowest bit rate to which the following stages provide successive improvements for gradually increasing bit rates. In the particular case of the encoding of audio signals, the stages of improvement are classically based on perceptual transform encoding as described in the above section.
However, one drawback of perceptual transform encoding in a hierarchical approach of this kind lies in the fact that the scale factors obtained have to be transmitted from the very first level or basic level. They then represent a major part of the bit rate allocated to the low bit rate level, as compared with the payload data.
To overcome this drawback and therefore save on the transmission of the injected quantization noise profile, i.e. the scale factors, a masking technique known as an “implicit” technique has been proposed by J. Li in “Embedded Audio Coding (EAC) With Implicit Auditory Masking”, ACM Multimedia 2002. A technique of this kind relies on the hierarchical structure of the encoding/decoding system for the recursive estimation of the masking curve at each refinement level, in exploiting an approximation of this curve, with refinement from level to level.
The updating of the masking curve is thus reiterated at each hierarchical level, using coefficients of the transform quantized at the previous level.
Since the estimation of the masking curve is based on the quantized values of the coefficients of the time-frequency transform, it can be done identically at the encoder and decoder: this has the advantage of preventing the transmission of the profile of the quantization interval, or quantization noise, to the decoder.
3. Drawbacks of the Prior Art
Even if the implicit masking technique, based on hierarchical encoding, prevents the transmission of the masking curve and thus provides for a gain in bit rate relative to the classic perceptual encoding in which the profile of the quantization interval is transmitted: the inventors have noted that it nevertheless has several drawbacks.
Indeed, the masking model implemented simultaneously in the encoder and the decoder is necessarily closed-ended, and can therefore not be adapted with precision to the nature of the signal. For example a single masking factor is used, independently of the tonal or atonal character of the components of the spectrum to be encoded.
Furthermore, the masking curves are computed on the assumption that the signal is a standing signal, and cannot be properly applied to the transient portions and to sonic attacks.
Furthermore, since the masking curves are obtained at each level from coefficients or residues of coefficients quantized at the previous levels, the masking curve for the first level is incomplete because certain portions of the spectrum have not yet been encoded. This incomplete curve does not necessarily represent an optimum shape of the profile of the quantization interval for the hierarchical level considered.