One of the primary ways that we separate sounds is to locate them in space. A sound from a fixed sound source arriving at two detectors (e.g. the ears, or two microphones) causes the two measured signals to be displaced in time with each other, due to a difference in transmission time. In a first approximation, this can be thought of as a difference in the straight line path from the sound source to the detectors. The time displacement is called ITD (Interaural Time Differences) and can be used to extract information about the azimuthal location of the sound source.
In addition, incident sound waves are usually diffracted and damped by the configuration (3D-shape, material) of the recording devices, e.g. a robot head. This causes a significant difference at the signal levels at the two ears. This so-called ILD/IID (Interaural Level Differences/Interaural Intensity Differences) is frequency dependent. For example, at low frequencies there is hardly any sound pressure difference at the two ears. However, at high frequencies, where the wavelengths of the sound get short in relation to the head diameter, there may be considerable differences, e.g. due to the head shadow effect. These differences vary systematically with the position of the sound source and can be used to gain information about its location.
ITD and ILD signals work in complementary ways. The accuracy of both varies independently according to the frequency range and the azimuthal sound source location. For non-preprocessed signals, at high frequencies, ambiguities in the ITD occur, since there are several possible cycles of shift. Incorporating ILD signals, which resolve the ambiguity providing reliable level differences for just these high frequencies, can level this out. The contribution of ITD cues to sound source localization is larger for frontally arriving signals and gets poorer with the sound source moving to the side because of a nonlinear dependency of the path difference from the angle of incidence. To the contrary, ILD cues are more accurate at the side areas because one recording device gets maximally damped and the other one minimally damped in this case. Similarly, ILD cues are less accurate at the frontal area because of reduced damping differences.
Conventional sound source localization methods include ITD calculations that operate on each frequency channel separately via delay lines (Jeffreys model), or by comparing different frequency channels by systematically shifting them against each other (stereausis model). ILD and monaural cues are explicitly modeled with head-related transfer-functions (HRTF's) (location-dependent spectral filtering of the sound because of outer ear/microphone-shape/material.)
In conventional methods, three problems concerning azimuthal sound source location remain. First, it is usually important to know in advance which delay resp. time shift corresponds to which azimuthal orientation to be able to pick the right representative vector for a particular orientation. Second, for adaptivity reasons it is desirable to bypass explicit models of ITD/ILD generation, instead, these should be “learnable” in an easy fashion. Another problem is how to combine ITD and ILD information that is highly frequency dependent. ITD and ILD are often computed using conceptually different procedures that make it nontrivial to compare the two measurements.