The present invention is concerned with processing sound signals for their spatialization.
Spatialized sound reproduction allows a listener to perceive sound sources originating from any direction or position in space.
The particular spatialized techniques of sound reproduction to which the present invention pertains are based on the acoustic transfer functions for the head between the positions in space and the auditory canal. These transfer functions termed “HRTF” (for “Head Related Transfer Functions”) relate to the frequency shape of the transfer functions. Their temporal shape will be denoted hereinafter by “HRIR” (for “Head Related Impulse Response”).
Additionally, the term “binaural” is concerned with reproduction on a stereophonic headset, but with spatialization effects. The present invention is not limited to this technique and applies in particular also to techniques derived from binaural such as so-called “transaural” reproduction techniques, that is to say those on remote loudspeakers. Such techniques can then use what is called “crosstalk cancellation” which consists in canceling the acoustic cross-paths in such a way that a sound, thus processed then emitted by the loudspeakers, can be perceived only by one of a listener's two ears.
The term “multichannel”, in processing for spatialized sound reproduction, consists in producing a representation of the acoustic field in the form of N signals (termed spatial components). These signals contain the whole set of sounds which make up the sound field, but with weightings which depend on their direction (or “incidence”) and described by N associated spatial encoding functions. The reconstruction of the sound field, for reproduction at a chosen point, is then ensured by N′ spatial decoding functions (usually with N=N′).
In the particular case of binaural, this decomposition makes it possible to carry out so-called “multichannel binaural” encoding and decoding. The decoding functions (which in reality are filters), associated with a given suite of spatial encoding functions (which in reality are encoding gains), when they are optimum in reproduction, ensure a feeling of perfect immersion of the listener within a sound scene, whereas in reality he has, for binaural reproduction, only two loudspeakers (earpieces of a headset or remote loudspeakers).
The advantages of a multichannel approach for binaural techniques are manyfold since the encoding step is independent of the decoding step.
Thus, in the case of composition of a virtual sound scene on the basis of synthesized or recorded signals, the encoding is generally inexpensive in terms of memory and/or calculations since the spatial functions are gains which depend solely on the incidences of the sources to be encoded and not on the number of sources themselves. The cost of the decoding is also independent of the number of sources to be spatialized.
In the case furthermore of a real sound field measured by an array of microphones and encoded according to known spatial functions, it is nowadays possible to find decoding functions which allow satisfactory binaural listening.
Finally, the decoding functions can be individualized for each of the listeners.
The present invention is concerned in particular with improved obtainment of the decoding filters and/or of the encoding gains in the multichannel binaural technique. The context is as follows: sources are spatialized by multichannel encoding and the reproduction of the spatially encoded content is performed by applying appropriate decoding filters.
The reference WO-00/19415 discloses a multichannel binaural processing which provides for the calculation of decoding filters. Denoting by:                gi(θp,φp) fixed spatial encoding functions where g is the gain corresponding to channel iε1, . . . , N and to position pε1, . . . , P defined by its angles of incidence θ (azimuth) and φ (elevation),        L(θp,φp,f) and R(θp,φp,f) bases of HRTF functions obtained by measuring the acoustic transfer functions of each ear L and R of an individual for a number P of positions in space (pε1, . . . , P) and for a given frequency f,        
this document WO-00/19415 essentially envisages two steps for obtaining filters on the basis of these spatial functions.
The delays are extracted from each HRTF. Specifically, the shape of a head is customarily such that, for a given position, a sound reaches one ear a certain time before reaching the other ear (a sound situated to the left reaching the left ear before reaching the right ear, of course). The difference in delay t between the two ears is an interaural index of location called the ITD (for “Interaural Time Difference”). New HRTF bases denoted L and R are then defined by:L(θp,φp,f)=TL(θp,φp)L(θp,φp,f) for p=1,2, . . . ,P R(θp,φp,f)=TR(θp,φp)L(θp,φp,f) for p=1,2, . . . ,P                 where TL,R=ej2πftL,R, with a delay tL,R         
Decoding filters Li(f) and Ri(f) for channel i which satisfy the equations:
                              L          _                ⁡                  (                                    θ              p                        ,                          φ              p                        ,            f                    )                    =                                    ∑                                          i                =                1                            ,              N                                ⁢                                          ⁢                                                    g                i                            ⁡                              (                                                      θ                    p                                    ,                                      φ                    p                                                  )                                      ⁢                                          L                i                            ⁡                              (                f                )                                      ⁢                                                  ⁢            for            ⁢                                                  ⁢            p                          =        1              ,    2    ,    …    ⁢                  ,    P                                R          _                ⁡                  (                                    θ              p                        ,                          φ              p                        ,            f                    )                    =                                    ∑                                          i                =                1                            ,              N                                ⁢                                          ⁢                                                    g                i                            ⁡                              (                                                      θ                    p                                    ,                                      φ                    p                                                  )                                      ⁢                                          R                i                            ⁡                              (                f                )                                      ⁢                                                  ⁢            for            ⁢                                                  ⁢            p                          =        1              ,    2    ,    …    ⁢                  ,    P                  are obtained in the second step,        and these may also be written, in matrix notation, L=GL and R=GR, G denoting a gain matrix.        
To obtain these filters, this document proposes a procedure termed “calculation of the pseudo-inverse” which is concerned with satisfying the previous equations within the least squares sense, i.e.:L=GL→L=(GTG−1)GTL
The implementation of such a technique therefore requires the reintroduction of a delay corresponding to the ITD at the moment of encoding each sound source. Each source is therefore encoded twice (once for each ear). Document WO-00/19415 specifies that it is possible not to extract the delays but that the sound rendition quality would then be worse. In particular, the quality is better, even with fewer channels, if the delays are extracted.
Additionally, a second approach, proposed in document U.S. Pat. No. 5,500,900, for jointly calculating the decoding filters and the spatial encoding functions, consists in decomposing the HRIR suites by performing a principal component analysis (PCA) then by selecting a reduced number of components (which corresponds to the number of channels).
An equivalent approach, proposed in U.S. Pat. No. 5,596,644, uses a singular value decomposition (SVD) instead. If the delays are extracted from the HRIRs before decomposition and then used at the moment of encoding, reconstruction of the HRIRs is very good with a reduced number of components.
When the delays are left in the original filters, the number of channels must be increased so as to obtain good quality reconstruction.
Moreover, these prior art techniques do not make it possible to have universal spatial encoding functions. Specifically, the decomposition gives different spatial functions for each individual.
It is also indicated that multichannel binaural can also be viewed as the simulation in binaural of a multichannel rendition on a plurality of loudspeakers (more than two). One then speaks of the so-called “virtual loudspeaker” procedure when, nevertheless, binaural reproduction is effected, according to this approach, solely on two earpieces of a headset or on two remote loudspeakers. The principle of such reproduction consists in considering a configuration of loudspeakers distributed around the listener. During rendition on two real loudspeakers, intensity panning (or “pan pot”) laws are then used to give the listener the sensation that sources are actually positioned in the space solely on the basis of two loudspeakers. One then speaks of “phantom sources”. Similar rules are used to define positions of virtual loudspeakers, this amounting to defining spatial encoding functions. The decoding filters correspond directly to the HRIR functions calculated at the positions of the virtual loudspeakers.
For efficacious spatial rendition with a small number of channels, the prior art techniques require the extraction of the delays from the HRIRs. The techniques of sound pick-up or multichannel encoding at a point in space are widely used since it is then possible to subject the encoded signals to transformations (for example rotations). Now, in the case where the signal to be decoded is a multichannel signal measured (or encoded) at a point, the delay information is not extractible on the basis of the signal alone. The decoding filters must then make it possible to reproduce the delays for optimal sound rendition. Moreover, in the case of recordings, the number of channels may be small and the prior art techniques do not allow good decoding with few channels without extracting the delays. For example in the acquisition technique based on ambiophonic microphones, the multichannel signal acquired may be constituted by only four channels, typically. The expression “ambiophonic microphones” is understood to mean microphones composed of coincident directional sensors. The interaural delays must then be reproduced on decoding.
More generally, the extraction of the delays exhibits at least two other major drawbacks:                the delays must be taken into account (addition of a step) at the moment of encoding, thereby increasing the necessary calculational resources,        the delays being taken into account at the moment of encoding, the signals must be encoded for each ear and the number of filterings necessary for the decoding is doubled.        
The present invention aims to improve the situation.