A robot may be qualified humanoid provided that it possesses certain attributes of the appearance and functionalities of a human: a head, a trunk, two arms, optionally two legs, etc. Generally, it is required that a humanoid robot be able to interact with human beings as “naturally” as possible, by sensing the presence of a person, by understanding his language, by engaging him in conversation, etc. The ability to be able to localize sound sources is very useful, or even necessary, to achieve this aim. Specifically, such an ability may allow a humanoid robot to determine the direction from which a sound is coming and to turn his head in that direction; if the sound was produced by a person, the robot may then activate a face recognition software package, configure optimally a voice recognition system, follow with “his gaze” the movements of this person, etc.
A plurality of methods and systems for finding the spatial localization of a plurality of sound sources are known in the prior art. These methods and systems are generally based on a plurality of microphones that are not or not very directional and on digital processing of the signals captured by said microphones.
The paper by J. DiBiase et al. “Robust localization in reverberant rooms” in “Microphone Arrays: Signal Processing Techniques and Applications” edited by M. S. Brandstein and D. B. Ward by Springer-Verlag, 2001, Berlin, Germany, describes three principal approaches to localizing a sound source.
A first approach uses spectral estimation techniques based on the correlation matrix of the signals captured by the microphones. Methods based on this approach tend to be sensitive to modelling errors and very demanding of computational power. They are mainly suitable for narrow-band signals.
A second approach is based on the estimation of time shifts between the sound signals received by pairs of microphones (“Time Difference Of Arrival” or TDOA techniques). These estimations are used, with the knowledge of the positions of the microphones, to calculate hyperbolic curves, the intersection of which gives the position of the source. The time shifts may especially be estimated by the PHAT-GCC (for “Phase Transform—Generalized Cross-Correlation”) method, which exploits the calculation of an intercorrelation—or cross correlation—between signals previously “whitened” by filtering. The PHAT-GCC method is described in more detail in the paper by Ch. H. Knapp and G. C. Carter “The Generalized Correlation Method for Estimation of Time Delay”, IEEE Transaction on Acoustics, Speech and Signal Processing, Vol. ASSP-24, No. 4, August 1976 pp. 320-327. These methods are computationally light but they are not robust to correlated noise originating from multiple sources and are subject to “false positives”. Furthermore, they are not very robust to reverberation, with the exception of the PHAT-GCC method.
A third approach consists in synthesizing an orientable acoustic beam by adding the signals captured by the various microphones, to which signals a variable time shift has been applied, and in identifying the orientation of the beam that maximizes the power of the composite signal thus received. Methods based on this approach tend to be not very robust to reverberation and noise, excepting certain variants that are however very demanding of computational power.
The paper more particularly describes a method combining the synthesis of an orientable acoustic beam and a generalized intercorrelation with phase transformation. This method is denoted SRP-PHAT (for “Steered Response Power—PHAse Transform”). Relative to the PHAT-GCC method, it is more robust to noise but more sensitive to reverberation.