The present invention concerns a method and device for localizing sound sources.
The invention may be applied in the field of Sound Source Localization (SSL) which aims at determining the directions of sound sources of interest such as speech, music, or environmental sounds.
Said directions are called Direction Of Arrival (DOA).
SSL methods operate on audio signals recorded within a given angle search window and within a given time duration by a set of microphones, or microphone array.
To determine the DOAs, SSL algorithms usually restrict the search to a given angle search window.
The window can be defined based on the framing of the visual field of view when the array is coupled to visual means, e.g. a camera.
In general, only direct sounds are used to localize sound sources through the estimation of differences in intensities and time delays between received signals at each microphone in a microphone array.
Direct sounds correspond to the acoustic waves emanating from the sources and impinging the microphones through direct paths from sources to microphones.
When the sources are placed at a relatively large distance with respect to the dimensions of the array, the acoustic conditions are said to be far field.
In these conditions, only the time delay differences can be physically exploited.
These time delay differences, also known as Time Differences Of Arrival (TDOA) are usually expressed relatively to a given microphone of the array.
The TDOA depend on the DOA of each source and on the geometry of the microphone array.
The main issue for SSL methods is to cope with realistic acoustic conditions including reverberation associated to multipath acoustic propagation and background noise.
Most of the SSL methods in the art exploiting TDOA belong to the class of so-called angular spectrum methods.
An overview of said methods can be found in “Multi-source TDOA estimation in reverberant audio using angular spectra and clustering” Charles Blandin; Alexey Ozerov; Emmanuel Vincent, in Signal Processing, Elsevier, 2012, 92, pp. 1950-196.
In most SSL methods, the audio signal is captured by the microphone array, which is itself connected to a digital sound capture system including pre-amplification, analog to digital conversion and synchronization means.
The digital sound capture system thus provides a multichannel set of recorded digital audio signals sharing the same sampling clock.
The SSL methods operate by first transforming the recorded signals in the time domain into time-frequency representations.
Then, a function of the DOA that is likely to exhibit large values for the true DOA (θ,φ) of the sources and a low value otherwise for the observed signals is built for each bin (time interval and frequency interval couple).
Said function, which depends on both spatial direction dimensions and time, is called the local angular spectrum, after which the local angular method is named.
Then, integrating or pooling the local angular spectrum across the time-frequency plane is performed, i.e. the angular spectrum function is reduced to a function of only spatial direction dimensions.
As far the frequencies, most methods sum up the local angular spectrum values over frequencies.
As far the pooling over time frames in the Discrete Fourier Transform processing, different pooling operations can be applied.
Calculating the local angular spectrum is the core step of SSL methods.
As described in the aforementioned paper by Blandin et al, the following main classes of local angular spectrum functions can be defined:                Generalized Cross Correlation (GCC) functions, such as in the so-called SRP-PHAT method as described in the paper “Robust localization in reverberant rooms”, J. DiBiase, H. Silverman, and M. S. Brandstein, in Microphone Arrays: Signal Processing Techniques and Applications, pp. 131-154, Springer, 2001;        variants of GCC-based functions defining a different frequency weighting at each frequency before integration over frequencies, as described in the paper, “Robust localization and tracking of simultaneous moving sound sources using beamforming and particle filtering”, by J.-M. Valin, F. Michaud, and J. Rouat, in Robotics and Autonomous Systems, 55(3), pp. 216-228, 2007;        subspace functions, such as in the MUSIC method as described, for instance, in the review paper “Two decades of Array Signal Processing research, the parametric approach”, H. Krim, M. Viberg in IEEE Signal Processing Magazine, pp 67-94, July 1996; and        beamforming functions also described in the aforementioned review paper by Krim et al.        
As for beamforming functions, the traditional approach for SSL is to define the local angular spectrum function as the Steered Response Power (SRP) which estimates the power of the source in a given direction (θ,φ), θ and φ being the angular spherical coordinates of a sound source.
Blandin et al. propose not to consider the SRP but rather a measure of the Signal to Noise ratio (SNR) of the audio source, defined by the ratio between the SRP of the source and the power of the noise, the power of the noise being defined as the difference between the total power minus the SRP of the source.
Assuming the noise being diffuse (istropic), Blandin et al. further proposes to define the local angular spectrum function as a weighted expression of the aforementioned SNR, i.e. the product of the SNR by a function depending upon the frequency having a closed-formed expression.
This is considered the best method published so far.
Although the aforementioned state-of-the-art methods perform reasonably well in some conditions, and in particular simulated conditions considering uncorrelated noise, it turns out that in some difficult real world conditions including ambient noise, the methods can fail in providing the right and/or complete SSL results.
Examples of ambient noise include air conditioning, electric devices, traffic, wind, hubbub (sources of no specific interest), electromagnetic interferences, etc.
Such ambient noise is generally “structured” in the sense that its angular spectrum is neither flat (isotropic, diffuse case) nor random but features directional characteristics.
Such structured noise can mask the sources of interest in the angular spectrum and hence jeopardize their detection and localization.
Typically, speech sources recorded outdoor in environment including strong electronic noise created by electromagnetic interference are particularly difficult to localize using the aforementioned methods, considering the electromagnetic noise has the effect of masking the sources of interest, hence providing inaccurate and/or false localization results.
More generally, the aforementioned SSL methods appear to be inaccurate and/or unreliable in any similar situation where sources of interest are placed within a sound environment comprising ambient noise sources that are close to sources of interest.
The problem is further difficult when considering compact size array devices considered in portable devices, e.g. when the distance between microphones typically not exceeds 20 cm (resulting in small TDOAs), when sources of interest are distant from the array (resulting in low SNR) and when sources of interest are close to each other (high resolution required).
The main reason behind the aforementioned problems is that Blandin et al. only consider reverberation as noise, and further assumes it as isotropic, i.e. independent from the direction of the noise.
Yet, in realistic environments, ambient noise can feature a very complex spatial covariance.
It is therefore preferable to account it directly in the model rather to rely on a theoretical model.
Generally, it is desirable to improve the performance of SSL methods in the aforementioned conditions.