Estimation of the position of a sound source (i.e., sound source localization) is disclosed, for example, in Jens Blauert, “Spatial Hearing: The Psychophysics of Human Sound Localization,” The MIT press, USA-Cambridge Mass., 1996, 2nd edition, incorporated herein by reference in its entirety. As described in this reference, the sound source localization is highly dependent on a large number of internal and external parameters, many of which might change over time. Therefore, it is important for an autonomous sound source localization system to have the ability to recalibrate the mapping of auditory cue information to an estimated position of at least one sound source (that is, perform online adaptation) during normal operation in normal working environment of the autonomous system.
A number of approaches including Kazuhiro Nakadai et al., “Real-time sound source localization and separation for robot audition,” Proceedings of the 2002 international conference on spoken language processing (ICSLP-2002), pp. 193-196, incorporated herein by reference in its entirety, localize a sound source and then direct attention or direction towards a target which is the source of the sound. The basic concept in all of such approaches measures one or more of so-called localization cues, and maps (or transforms) these cues into estimation of the location (angular position or azimuth angle).
One type of cues for localization are so-called binaural cues that are comparison of signals recorded by microphones located apart by a distance. The binaural cues can be obtained by a pair of microphones as well as an array of microphones having more than two microphones.
Binaural cue computation makes use of the fact that for microphones at different spatial positions, signals travel along slightly different paths to the two microphones. One well-known binaural cue is the Intra-aural Time Difference (ITD, also known as IPD). It measures the distance by detecting the arrival times of the signals at the two microphones. A related cue is the Intra-aural Envelope Difference (IED) which is similar to the ITD. Both the ITD and the IED depend on the location of the sound source in the horizontal plane covering the two microphones, and the distance between the two microphones (and the speed of sound). The presence of any obstacles between the microphones (e.g., a head of a robot) and the shapes of the obstacles have a slight effect when using the ITD.
Another type of second major cue for the sound source localization is the Intra-aural Intensity Difference (IID, also termed ILD). This cue is based on comparison of signal intensities at the two microphones. The presence of any obstacles between the microphones will also affect the signal intensities depending on the location of the sound source.
While the dependency of ITD on source localization can at least be approximated based on known environmental conditions and the distance between the microphones, the IID depends on the shape, material and density of any obstacles present between the microphones. It is therefore very difficult to compute the IID as a function of the location of the sound source. In addition to the basic dependencies, there are a number of additional factors that might affect computation of the cue: different levels of signal amplification at the two microphones, non-synchronization of the recording of left and right microphones, types and exact locations of the microphones, types and presence of auditory pre-processing stages, the particular method used in computing the cue computation, etc.
Especially the analog part of the recording equipment is prone to changes due to temperature drifts and variable operating time (system warm up). Another important factor is the room characteristics (e.g., echoes) which might also strongly affect the estimation of the sound source localization.
Therefore, it is a standard practice to calibrate the system in advance in a dedicated setup to learn the relation between the location of the sound source and IED/ITD/IID cue values. These calibration measurements are generally time-consuming and require a substantial amount of efforts and time to execute. Further, this calibration has to be repeated whenever there is a change in the parameter of the system hardware, for example, mounting microphones onto a different head of a robotic device, using new recording hardware, modification to the amplification factors, etc. For any changes made to the software, at least part of the calibration procedure has to be repeated. Any of those effects would require a new, lengthy calibration procedure.
It is therefore advisable to allow the system to learn the relation between cues and location of the sound source continuously in an unsupervised manner. However, the state-of-the-art approaches for learning the cue value—position relation either require special test scenarios (that is, bringing the system into a defined environment and running a dedicated calibration procedure) or the location information from additional sensors. These additional sensors so far work only under very constrained conditions.
European Patent Application No. 1 586 421 A1 discloses a system for sensory-motor learning in a visual learning system which has the basic characteristics for online adaptation necessary for a truly autonomous system. However, an important prerequisite to adaptation is to obtain the information about the true location of the sound source. If for measured cue values C, the true relative location p of the sound source is known, learning is trivial. Using a mapping function T, the following equation applies: T(C)=p. As for consecutive measurements for the same cue value C, the following equation applies:T(C,t+1)=T(C,t)+alpha*(p(t)−T(C,t)),where t represents time step, and alpha represents a learning parameter (0<alpha<1). Alternatively, the mapping function may be of the type T(p)=C as shown in FIG. 3. T may be implemented for example by a look-up table.
Hiromichi Nakashima et al., “Self-Organization of a Sound Source Localization Robot by Perceptual Cycle,” 9th International Conference on Neural Information Processing (ICONIP'02), 2002 discloses an auto-calibration procedure that uses cameras to visually identify the sound source and measure its location. This approach can not be used for the online-adaptation because it is not easy to visually identify a sound source. In the example described in this article, a red mark was placed on the speaker box to identify the source. However, this requires that no other red objects be present in the environment.
What is needed is a method and system for detecting the location of a sound source using only auditory inputs. There is also a need for a method and system for reducing the cost of hardware and constraints for detecting the location of the sound source. Further, there is a need for a method and system for calibrating the sound source localization in realistic settings that provides a robust performance for a prolonged period of time. Furthermore, there is also a need for a method and system providing a continuous adaptation that can better learn a cues/localization estimation mapping than standard calibration procedures.