It relates in particular, but not exclusively, to a method for processing acoustic data originating from a three-dimensional sound scene capable of extracting an item of information relating to the spatial position of sound sources. It can be used equally well in applications for spatialized sound take within the context of conversational services, as for the recording of 3D audio content (for example a concert, a soundscape, etc).
Various methods of spatialized sound take are known. Some seek to capture the information utilized by the auditory system (binaural technology for example) while others break down the acoustic field so as to reproduce more or less rich spatial information which will be interpreted by the listener (ambisonic technology for example).
A first method consists of a stereophonic sound take. The differences of phase and/or time, and amplitude, between signals originating from two microphones are utilized in order to recreate stimuli constituting a rough approximation to natural listening. These signals are restored via a pair of loudspeakers always placed facing the listener and aligned in the horizontal plane. In such a configuration, all information originating from behind the listener and all concept of elevation are lost. In order to enrich the rear of the sound scene, numerous solutions have been proposed. In particular, such solutions generally consist of an increase in the number of sensors targeting the sought directions. Provision can also be made for mastering the stereophonic signals in order to enrich the rear of the sound scene. Such solutions gave rise to 5.1 and 7.1 quadraphonic systems.
However, the stereophonic sound take is still limited to the frontal horizontal plane, or the horizontal plane in the case of the multichannel extensions of the 5.1 type. In other words, in the best case, with spherical coordinates, it is only capable of identifying the azimuth information of the sound sources (the coordinates of the sources in a horizontal plane x-y), without, however, being able to identify their elevation information.
A second method consists of a binaural sound take. Binaural technology allows capture and restoration imitating natural listening, allowing in particular the localization of a source within the entire space surrounding the listener, using only two microphones. The microphones are placed in the ears of a person or of a dummy in order to record the acoustic scene and the sound indices of natural localization.
However, direct sound take using binaural technology has various drawbacks. In fact, when the sound take is carried out on the head of a person, the person wearing the microphones must remain immobile, control his respiration and avoid swallowing in order not to degrade the quality of the recording. The use of an artificial head is difficult to envisage when an unobtrusive, portable use is sought. At the time of reproduction, the incompatibility of the transfer functions relating to the listener's head (Head Related Transfer Function” or HRTF) between the capture device and the final listener tends to falsify the localization of the sources. Furthermore, when the final listener moves his head, the entire sound scene is displaced.
Thus, although binaural sound take is capable of encoding the spatial information of the sources in any three-dimensional space, such encoding is specific to the morphology of the person or the dummy which was used for the recording. To date, no satisfactory solution has been proposed to remedy these limitations. An additional drawback is that binaural recording can only be listened to on specific dedicated equipment such as a helmet or a system of loudspeakers, combined with pre-processing.
A third method consists of an ambisonic sound take by capture of the sound field. Such a technology was introduced in document U.S. Pat. No. 4,042,779 for first-order spherical harmonics, and its extension to higher orders, (higher-order ambisonics or HOA), was described for example in the document J. Daniel, “Représentation de champs acoustiques, application á la transmission et á la reproduction de scénes sonores complexes dans un contexte multimédia”, [Representation of acoustic fields, application to the transmission and reproduction of complex sound scenes in a multimedia context] Université Paris 6, Paris, 2001. These documents allow more or less accurate acquisition of the sound scene, depending on the order of the spherical harmonics used.
However, such technology has the drawback of using a large number of sensors, which is a function of the desired order. The use of first-order ambisonic technology has been widely exploited due to the small number of sensors required for its implementation (four microphones, see U.S. Pat. No. 4,042,779). Signals originating from the four microphones are derived by mastering (encoding), the four signals defining the B-format ambisonic technology. The signals derived by mastering correspond to the signals which would have been recorded by an omnidirectional microphone and three velocity gradient microphones oriented along axes x, y and z. The four derived signals are recorded and can then be reproduced to a listener by using a system of arbitrarily distributed loudspeakers by means of a decoding matrix. The loudspeakers chosen in this way can also be obtained in the form of virtual sources for binaural reproduction, using the HRTF transfer functions relating to the position of each source.
Thus, the ambisonic sound take is also capable of encoding the spatial information of the sources throughout the 3D space, but it has the drawback of requiring a large number of sensors, namely a minimum of 4, and potentially an even greater number when satisfactory spatial accuracy is sought.
Post-processing combined with the spatialized sound take can also be envisaged, in order to overcome the drawbacks detailed above.
In particular, such processing methods are applied in order to improve the extraction of the spatial information. To date, post-processing has been applied to signals of the ambisonic type, because the latter give access to a physical representation of the acoustic waves.
The document by V. Pulkki, “Directional audio coding in spatial sound reproduction and stereo upmixing”, in Proc. of the AES 28th Int. Conf, Pitea, Sweden, 2006, proposes a method for extracting the localization information of the sources from B-format signals. The objective of such a method is to obtain a more compact representation of the three-dimensional sound scene (data compression), in which the four signals originating from the B-format are restored to a single monophonic signal accompanied by a signal containing the localization information of the sound sources.
An improvement to this method was proposed in the document by N. Barrett and S. Berge, “A new method for B-format to binaural transcoding”, in 40th AES International conference. Tokyo, Japan, 2010, p. 8-10. This improvement provides for the use of the localization information in order to spatialize the virtual sound sources with a view to reproduction via loudspeakers or binaural transcoding. The virtual sound sources are thus re-spatialized afterwards in accordance with their identified position, in the spatialization format associated with the reproduction device.
However, regardless of the preceding method or its improved version, the position of the sources is determined with an ambiguity (typically an angular ambiguity of ±π/2 on the azimuth angle in the document by V. Pulkki), which is not resolved. The position of the sound source is then not known with certainty.