Audio source localization (ASL) allows a system to locate a speaker using only the received sound signals. The location of the speakers in a room can then be used in a speaker segmentation application, for example. Furthermore, this information can be used for enhancement, using beamforming techniques for example, where the signal of interest may be enhanced and interfering sounds may be attenuated regarding the location of audio sources. Several approaches have been proposed for ASL. However, performing a robust estimation in high noise and reverberation conditions is still a challenging problem.
Common approaches estimate the location of the sound directly from the time delay of arrival (TDOA) between pairs of microphones, or the direction-of-arrival (DOA) of impinging sound waves to a microphone array, based on the sound wave propagation model of direct-path sound waves and the positioning of microphones. The most popular technique for TDOA estimation is based on cross-correlations between pairs of microphones, the most popular being the Generalized Cross-Correlation of the Phase Transform (GCC-PHAT), which estimates the TDOA from phase difference between narrowband signals in the frequency domain. The GCC-PHAT method emphasizes the phase differences in all frequency bins equally, which introduces sensitivity to broadband noise. Non-uniform spectral weighting of the PHAT, which uses narrowband signal-to-noise ratio (SNR), lessens the contribution of frequencies with low narrowband SNR and provides robustness against noise. However, sub-optimal estimation of narrowband SNR degrades the performance of the non-uniform PHAT weighting, for instance, in the presence of coherent broadband noise introduced by reverberation, may generate a false TDOA. While several viable solutions exist in non-coherent noise reduction and SNR estimation, coherent noise reduction (de-reverberation) and coherent noise estimation are still challenging problems.