Electronic communication becomes more and more prevalent nowadays. For instance, automatic speech recognition and control comprising speaker identification/verification is commonly used in a variety of applications. Communication between different communication partners can be performed by means of microphones and loudspeakers in the context of communication systems, e.g., in-vehicle communication systems and hands-free telephone sets as well as audio/video conference systems. Speech signals detected by microphones, however, are often deteriorated by background noise that may or may not include speech signals of background speakers. High energy levels of background noise might cause failure of the communication process.
In the above applications, accurate localization of a speaker is often necessary or at least desirable for a reliable detection of a wanted signal and signal processing. In the context of video conferences it might be advantageous to automatically point a video camera to an actual speaker whose location can be estimated by means of microphone arrays.
In the art, speaker localization based on Generalized Cross Correlation (GCC) or by adaptive filters are known. In both methods two or more microphones are used by which phase shifted signal spectra are obtained. The phase shift is caused by the finite distance between the microphones.
Both methods aim to estimate the relative phasing of the microphones or the angle of incidence of detected speech in order to localize a speaker (for details see, e.g., G. Doblinger, “Localization and Tracking of Acoustical Sources”, in Topics in Acoustic Echo and Noise Control, pp. 91-122, Eds. E. Hänsler and G. Schmidt, Berlin, Germany, 2006; Y. Huang et al., “Microphone Arrays for Video Camera Steering”, in Acoustic Signal Processing for Telecommunication, pp. 239-259, S. Gay and J Benesty (Eds.), Kluwer, Boston, Mass., USA, 2000; C. H. Knapp and G. C. Carter, “The generalized correlation method for estimation of time delay”, IEEE Trans. on Acoustics, Speech and Signal Processing, vol. 24, no. 4, pp. 320-327, August, 1976). In the adaptive filtering approach, it is basically intended to filter one microphone signal to obtain a model of the other one. The appropriately adapted filter coefficients include the information necessary for estimating the time delay between both microphone signals and thus allow for an estimate of the angle of incidence of sound.
The GCC method is expensive in that it gives estimates for time delays between different microphone signals that comprise unphysical values. Moreover, a fixed discretization in time is necessary. Speaker localization by adaptive filters can be performed in the frequency domain in order to keep the processor load reasonably low. The filter is realized by sub-band filter functions and can be temporarily adapted to account for time-dependent and/or frequency-dependent noise (signal-to-noise ratio).
However, even processing in the frequency-domain is time-consuming and demands for relatively large memory capacities, since the scalar filter functions (factors) have to be realized by means of high-order Fast Fourier Transforms in order to guarantee a sufficiently realistic modeling of the impulse response. The corresponding Inverse Fast Fourier Transforms are expensive. In addition, it is necessary to analyze the entire impulse response including late reflections that are to be taken into account for correct modeling of the impulse response but are of no use for the speaker localization.
Therefore, an improved method for speaker localization by means of multiple microphones is still desirable.