The present invention relates to audio signal encoding devices, which are intended, in particular, to find a place in digitized and compressed audio signal storage or transmission applications.
The invention relates more precisely to hierarchical audio encoding systems having the capability of providing varied rates, by dividing up the information relating to an audio signal to be encoded into hierarchized subsets, whereby they can be used by order of importance with respect to the restitution quality of the audio signal. The criterion taken into account for determining the order is an optimization criterion (or rather a least degradation criterion) of the quality of the encoded audio signal. Hierarchical encoding is particularly suited to transmission over heterogeneous networks or those having available rates which are variable over time, or also to transmitting to terminals having different or variable characteristics.
The invention relates more particularly to the hierarchical encoding of 3D sound scenes. A 3D sound scene includes a plurality of audio channels corresponding to monophonic audio signals and is also referred to as spatialized sound.
An encoded sound scene is intended to be reproduced on a sound rendering system, which can include a ordinary headset, two speakers of a computer or also a Home Cinema 5.1 type of system with five speakers (one speaker near the screen and in front of the theoretical listener: one speaker to the left and one speaker to the right; behind the theoretical listener: one speaker to the left and one speaker to the right), or the like.
For example, consider an original sound scene comprising three distinct sound sources located at various locations in space. The signals describing this sound scene are encoded by an encoder. The data derived from this encoding is transmitted to the decoder, and then decoded. The decoded data is processed so as to generate five signals intended for the five speakers of the sound reproduction system in question. Each of the five speakers broadcasts one of the signals, the set of signals broadcast by the speakers synthesizing the 3D sound scene and therefore locating three virtual sound sources in space.
Spatial resolution or spatial accuracy measures the degree of fineness of the location of the sound sources in space. Increased spatial resolution enables finer positioning of the sound objects in the room and enables a broader restitution area around the listener's head.
Various techniques exist for encoding sound scenes.
For example, one technique used includes the determination of elements describing the sound scene, and then operations for compressing each of the monophonic signals. The data derived from these compressions and the description elements are then supplied to the decoder.
Rate adaptability (also called scalability) according to this first technique can thus be accomplished by adapting the rate during the compression operations, but it is carried out according to criteria for optimizing the quality of each signal considered individually. During the encoding operation, no account is taken of the spatial accuracy of the 3D scene resulting from the restitution of the various signals.
Another encoding technique, which is used in the “MPEG Audio Surround” encoder (cf. “Text of ISO/IEC FDIS 23003-1, MPEG Surround”, ISO/IEC JTC1/SC29/WG11 N8324, July 2006, Klagenfurt, Austria), includes the extraction and encoding of spatial parameters from all of the monophonic audio signals on the various channels. These signals are then mixed to obtain a monophonic or stereophonic signal, which is then compressed by a conventional mono or stereo encoder (e.g., of the MPEG-4 AAC, HE-AAC type, etc.). At the decoder level, synthesis of the 3D sound scene is carried out from the spatial parameters and decoded mono or stereo signal.
With this other technique, rate adaptability can thus be achieved by using a hierarchical mono or stereo encoder, but it is carried out according to a criterion for optimizing the quality of the monophonic or stereophonic signal, and also does not either take account of the quality of the spatial resolution.
In addition, the PSMAC (Progressive Syntax-Rich Multichannel Audio Codec) method enables encoding of the signals from various channels by using the KLT Transform (Karhunen Loeve Transform), which is primarily useful for decorrelation of the signals and which corresponds to a principal components decomposition in a space representing the signal statistics. It makes it possible to distinguish the more energetic components from the less energetic components.
The rate adaptability is based on cancellation of the less energetic components and not at all by taking account of spatial accuracy.
Thus, although the known techniques yield good results in terms of rate adaptability, none of the known 3D sound scene encoding techniques enables rate adaptability on the basis of a criterion for optimizing spatial resolution during restitution of the 3D sound scene. Such adaptability would make it possible to guarantee that each reduction in rate would harm the positioning accuracy of the sound sources in space as little as possible.
Furthermore, none of the known 3D sound scene encoding techniques enables a rate adaptability which makes is possible to directly guarantee optimal quality, irrespective of the sound rendering system used for restitution of the 3D sound scene. The current encoding algorithms are defined to optimize quality with respect to a particular configuration of the sound reproduction system. As a matter of fact, in the case of the above-described “MPEG Audio Surround” encoder, for example, direct listening with a headset or two speakers, or also monophonic listening is possible. If it is desired to process the compressed bit stream with a 5.1 or 7.1-type sound reproduction system, additional processing must be implemented at the decoder level, e.g., by means of OTT (One-To-Two) boxes, in order to generate the five or seven signals from the two decoded signals. These boxes enable obtainment of the desired number of signals in the case of a 5.1 or 7.1-type sound reproduction system, but do not make it possible to reproduce the real spatial aspect. Furthermore, these boxes do not guarantee adaptability to sound reproduction systems other than those of the 5.1 or 7.1 type.