The present invention relates to an audio processing and, more particularly, a three-dimensional spatialization of synthetic sound sources.
Currently, the spatialization of a synthetic sound source is often performed without taking account of the sound production mode, that is, of the way in which the sound is synthesized. Thus, many models, notably parametric, have been proposed for the synthesis. In parallel, numerous spatialization techniques have also been proposed, without, however, proposing a cross-check with the technique chosen for a synthesis.
Known among the synthesis techniques are the so-called “non-parametric” methods. No particular parameter is used a priori to modify samples previously stored in memory. The best known representative of these methods is the conventional wave table synthesis.
Contrasting with this type of technique are the “parametric” synthesis methods which rely on the use of a model for manipulating a reduced number of parameters, compared to the number of signal samples produced in the non-parametric methods. The parametric synthesis techniques typically rely on additive, subtractive, source/filter or non-linear models.
Among these parametric methods, the term “mutual” can be used to qualify those that make it possible to jointly manipulate parameters corresponding to different sound sources, to then use only a single synthesis process, but for all the sources. In the so-called “sinusoidal” methods, typically, a frequency spectrum is constructed from parameters such as the amplitude and the frequency of each partial component of the overall sound spectrum of the sources. Indeed, an inverse Fourier transform implementation, followed by an add/overlap, provides an extremely effective synthesis of several sound sources simultaneously.
Regarding the spatialization of sound sources, different techniques are currently known. Some techniques (like “transaural” or “binaural”) are based on taking into account HRTF transfer functions (“Head Related Transfer Function”) representing the disturbance of acoustic waves by the morphology of an individual, these HRTF functions being specific to that individual. The sound playback is adapted to the HRTFs of the listener, typically on two remote loudspeakers (“transaural”) or from the two earpieces of a headset (“binaural”). Other techniques (for example “ambiophonic” or “multichannel” (5.1 to 10.1 or above) are geared more towards a playback on more than two loudspeakers.
More specifically, certain HRTF-based techniques use the separation of the “frequency” and “position” variables of the HRTFs, thus giving a set of p basic filters (corresponding to the first p values specific to the covariance matrix of the HRTFs, of which the statistical variables are the frequencies), these filters being weighted by spatial functions (obtained by projection of the HRTFs on basic filters). The spatial functions can then be interpolated, as described in the document U.S. Pat. No. 5,500,900.
The spatialization of numerous sound sources can be performed using a multichannel implementation applied to the signal of each of the sound sources. The gains of the spatialization channels are applied directly to the sound samples of the signal, often described in the time domain (but possibly also in the frequency domain). These sound samples are processed by a spatialization algorithm (with applications of gains that are a function of the desired position), independently of the origin of these samples. Thus, the proposed spatialization could be applied equally to natural sounds and to synthetic sounds.
On the one hand, each sound source must be synthesized independently (with a time or frequency signal obtained), in order to be able to then apply independent spatialization gains. For N sound sources, it is therefore necessary to perform N synthesis calculations.
On the other hand, the application of the gains to sound samples, whether deriving from the time or frequency domain, requires at least as many multiplications as there are samples. For a block of Q samples, it is therefore necessary to apply at least N.M.Q gains, M being the number of intermediate channels (ambiophonic channels for example) and N being the number of sources.
Thus, this technique entails a high calculation cost in the case of the spatialization of numerous sound sources.
Among the ambiophonic techniques, the so-called “virtual loudspeaker” method makes it possible to encode the signals to be spatialized by applying to them gains in particular, the decoding being performed by convolution of the encoded signals by pre-calculated filters (Jérôme Daniel, “Représentation de champs acoustiques, application à la transmission et à la reproduction de scànes sonores complexes dans un contexte multimédia”, [Representation of acoustic fields, application to the transmission and reproduction of complex sound scenes in a multimedia context], doctoral thesis, 2000).
A very promising technique, combining synthesis and spatialization, has been presented in the document WO-05/069272.
It consists in determining amplitudes to be assigned to signals representing sound sources, to define both the sound intensity (for example a “volume”) of a source to be synthesized and a spatialization gain of this source. This document notably discloses a binaural spatialization with delays and gains (or “spatial functions”) taken into account and, in particular, a mixing of the synthesized sources in the spatialization encoding part.
Even more particularly, an exemplary embodiment which is targeted in this document WO-05/069272 and in which the sources are synthesized by associating amplitudes with constitutive frequencies of a “tone” (for example a fundamental frequency and its harmonics) provides for synthesis signals to be grouped together by identical frequencies, with a view to subsequent spatialization applied to the frequencies.
This exemplary embodiment is illustrated in FIG. 1. In a synthesis block SYNTH (represented by broken lines), to frequencies f0, f1, f2, . . . , fp of each source to be synthesized S1, . . . , SN are assigned respective amplitudes a01, a11, . . . , ap1, . . . , aij, . . . , a0N, a1N, . . . , apN, in which, in the general notation aij, j is a source index between 1 and N and i is a frequency index between 0 and p. Obviously, certain amplitudes of a set a0j, a1j, . . . , apj to be assigned to one and the same source j can be zero if the corresponding frequencies are not represented in the tone of this source j.
The amplitudes ai1, . . . , aiN relating to each frequency fi are grouped together (“mixed”) to be applied, frequency by frequency, to the spatialization block SPAT for an encoding applied to the frequencies (binaurally, for example, by then providing an inter-aural delay to be applied to each source). The signals of the channels c1, . . . , Ck, derived from the spatialization block SPAT, are then intended to be transmitted through one or more networks, or even stored, or otherwise dealt with, with a view to subsequent playback (preceded, where appropriate, by a suitable spatialization decoding).
This technique, although very promising, still warrants optimizations.
Generally, the current methods require significant calculation powers to spatialize numerous synthesized sound sources.
The present invention improves the situation.