The present invention relates to audio signal encoding devices, intended in particular to find a place in digitized and compressed audio signals storage or transmission applications.
The invention relates more precisely to audio hierarchical encoding systems, having the capacity to provide varied rates, by distributing the information relating to an audio signal to be encoded in hierarchically-arranged subsets, such that this information can be used in order of importance with respect to the audio quality. The criterion taken into account for determining the order is a criterion of optimization (or rather of least degradation) of the quality of the encoded audio signal. Hierarchical encoding is particularly suited to transmission over heterogeneous networks or those having available rates varying over time, or also transmission to terminals having different or variable characteristics.
The invention relates more particularly to the hierarchical encoding of 3D sound scenes. A 3D sound scene comprises a plurality of audio channels corresponding to monophonic audio signals and is also known as spatialized sound.
An encoded sound scene is intended to be reproduced on a sound rendering system, which can comprise a simple headset, two speakers of a computer or also a Home Cinema 5.1 type system with five speakers (one speaker at the level of the screen and in front of the theoretical listener: one speaker to the left and one speaker to the right; behind the theoretical listener: one speaker to the left and one speaker to the right), etc.
For example, consider an original sound scene comprising three distinct sound sources, located at different locations in space. The signals describing this sound scene are encoded. The data resulting from this encoding are transmitted to the decoder, and are then decoded. The decoded data are utilized in order to generate five signals intended for the five speakers of the sound rendering system. Each of the five speakers broadcasts one of the signals, the set of signals broadcast by the speakers synthesizing the 3D sound scene and therefore locating three virtual sound sources in space.
Different techniques exist for encoding sound scenes.
For example, one technique used comprises the determination of elements of description of the sound scene, then operations of compression of each of the monophonic signals. The data resulting from these compressions and the elements of description are then supplied to the decoder.
The rate adaptability (also called scalability) according to this first technique can therefore be achieved by adapting the rate during the compression operations, but it is achieved according to criteria of optimization of the quality of each signal considered individually.
Another encoding technique, which is used in the “MPEG Audio Surround” encoder (cf. “Text of ISO/IEC FDIS 23003-1, MPEG Surround”, ISO/IEC JTC1/SC29/WG11 N8324, July 2006, Klagenfurt, Austria), comprises the extraction and the encoding of spatial parameters from all of the monophonic audio signals on the different channels. These signals are then mixed in order to obtain a monophonic or stereophonic signal which is then compressed by a standard mono or stereo encoder (for example of MPEG-4 AAC, HE-AAC, etc. type). At the level of the decoder, the synthesis of the 3D sound scene is carried out based on the spatial parameters and the decoded mono or stereo signal.
The rate adaptability with this other technique can thus be achieved using a hierarchical mono or stereo encoder, but it is achieved according to a criterion of optimization of the quality of the monophonic or stereophonic signal.
Moreover, the PSMAC (Progressive Syntax-rich Multichannel Audio Codec) method makes it possible to encode the signals of different channels by using the KLT (Karhunen Loeve Transform), which is useful mainly for the decorrelation of the signals and which corresponds to a principal components decomposition in a space representing the statistics of the signals. It makes it possible to distinguish the highest-energy components from the lowest-energy components.
The rate adaptability is based on a cancellation of the lowest-energy components. However, these components can sometimes have great significance with regard to overall audio quality.
Thus, although the known techniques produce good results with respect to rate adaptability, none proposes a completely satisfactory rate adaptability method based on a criterion of optimization of the overall audio quality, aimed at defining compressed data optimizing the perceived overall audio quality, during the restitution of the decoded 3D sound scene.
Moreover, none of the known 3D sound scene encoding techniques allows rate adaptability based on a criterion of optimization of the spatial resolution, during the restitution of the 3D sound scene. This adaptability makes it possible to guarantee that each rate reduction will degrade as little as possible the precision of the locating of the sound sources in space, as well as the dimension of the restitution zone, which must be as wide as possible around the listener's head.
Moreover, none of the known 3D sound scene encoding techniques allows rate adaptability which would make it possible to directly guarantee optimum quality whatever the sound rendering system used for the restitution of the 3D sound scene. The current encoding algorithms are defined in order to optimize the quality in relation to a particular configuration of the sound rendering system. In fact, for example in the case of the “MPEG Audio Surround” encoder described above utilized with hierarchical encoding, direct listening with a headset or two speakers, or also monophonic listening is possible. If it is desired to utilize the compressed bitstream with a sound rendering system of type 5.1 or 7.1, additional processing is required at the level of the decoder, for example using OTT (“One-To-Two”) boxes for generating the five signals from the two decoded signals. These boxes make it possible to obtain the desired number of signals in the case of a sound rendering system of type 5.1 or 7.1, but do not make it possible to reproduce the real spatial aspect. Moreover, these boxes do not guarantee the adaptability to sound rendering systems other than those of types 5.1 and 7.1.