1. Technical Field
The invention is related to microphone array-based sound source localization (SSL), and more particularly to a system and process for estimating the location of a speaker anywhere in a full 360 degree sweep from signals output by a single microphone array characterized by two or more pairs of audio sensor using an improved time-delay-of-arrival based SSL technique.
2. Background Art
Microphone arrays have become a rapidly emerging technology since the middle 1980's and become a very active research topic in the early 1990's [Bra96]. These arrays have many applications including, for example, video conferencing. In a video conferencing setting, the microphone array is often used for intelligent camera management where sound source localization (SSL) techniques are used to determine where to point a camera or decide which camera in an array of cameras to activate, in order to focus on the current speaker. Intelligent camera management via SSL can also be applied to larger venues, such as in a lecture hall where a camera can point to the audience member who is asking a question. Microphone arrays and SSL can also be used in video surveillance to identify where in a monitored space a person is located. Further, speech recognition systems can employ SSL to pinpoint the location of the speaker so as to restrict the recognition process to sound coming from that direction. Microphone arrays and SSL can also be utilized for speaker identification. In this context, the location of a speaker as discerned via SSL techniques is correlated to an identity of the speaker.
For most of the video conferencing related projects/papers, usually there is a video capture device controlled by the output of SSL. The video capture device can either be a controllable pan/tilt/zoom camera [Kle00, Zot99, Hua00] or an omni-directional camera. In either case, the output of the SSL can guide the conferencing system to focus on the person of interest (e.g., the person who is talking).
In general there are three techniques for SSL, i.e., steered-beamformer-based, high-resolution spectral-estimation-based, and time-delay-of-arrival (TDOA) based techniques [Bra96]. The steered-beamformer-based technique steers the array to various locations and searches for a peak in output power. This technique can be tracked back to early 1970s. The two major shortcomings of this technique are that it can easily become stuck in a local maxima and it exhibits a high computational cost. The high-resolution spectral-estimation-based technique representing the second category uses a spatial-spectral correlation matrix derived from the signals received at the microphone array sensors. Specifically, it is designed for far-field plane waves projecting onto a linear array. In addition, it is more suited for narrowband signals, because while it can be extended to wide band signals such as human speech, the amount of computation required increases significantly. The third category involving the aforementioned TDOA-based SSL technique is somewhat different from the first two since the measure in question is not the acoustic data received by the microphone array sensors, but rather the time delays between each sensor. This last technique is currently considered the best approach to SSL.
TDOA-based approaches involve two general phases—namely time delay estimation (TDE) and location phases. Within the TDE phase, of the various current TDOA approaches, the generalized cross-correlation (GCC) approach receives the most research attention and is the most successful [Wan97]. Let s(n) be the source signal, and x1(n) and x2(n) be the signals received by two microphones of the microphone array. Then:x1(n)=as(n−D)+h1(n)*s(n)+n1(n)x2(n)=bs(n)+h2(n)*s(n)+n2(n)  (1)where D is the TDOA, a and b are signal attenuations, n1(n) and n2(n) are the additive noise, and h1(n) and h2(n) represent the reverberations. Assuming the signal and noise are uncorrelated, D can be estimated by finding the maximum GCC between x1(n) and x2(n) as follows:
                              D          =                      arg            ⁢                                                                                            ⁢                max                            τ                        ⁢                                                            R                  ^                                                                      x                    1                                    ⁢                                      x                    2                                                              ⁡                              (                τ                )                                                    ⁢                                  ⁢                                                            R                ^                                                              x                  1                                ⁢                                  x                  2                                                      ⁡                          (              τ              )                                =                                    1                              2                ⁢                π                                      ⁢                                          ∫                                  -                  π                                π                            ⁢                                                W                  ⁡                                      (                    ω                    )                                                  ⁢                                                      G                                                                  x                        1                                            ⁢                                              x                        2                                                                              ⁡                                      (                    ω                    )                                                  ⁢                                  ⅇ                  jωτ                                ⁢                                                                  ⁢                                  ⅆ                  ω                                                                                        (        2        )            where {circumflex over (R)}x1x2 (τ) is the cross-correlation of x1(n) and x2(n), Gx1x2 (ω) is the Fourier transform of {circumflex over (R)}x1x2 (τ), i.e., the cross power spectrum, and W(ω) is the weighting function.
In practice, choosing the right weighting function is of great significance for achieving accurate and robust time delay estimation. As can be seen from Eq. (1), there are two types of noise in the system, i.e., the background noise n1(n) and n2(n) and reverberations h1(n) and h2(n). Previous research suggests that a maximum likelihood (ML) weighting function is robust to background noise and a phase transformation (PHAT) weighting function is better in dealing with reverberations [Bra99], i.e.,:
                                                        W              ML                        ⁡                          (              ω              )                                =                      1                          ∥                              N                ⁡                                  (                  ω                  )                                            ⁢                              ∥                2                                                    ⁢                                  ⁢                                            W              PHAT                        ⁡                          (              ω              )                                =                      1                          ∥                                                G                                                            x                      1                                        ⁢                                          x                      2                                                                      ⁡                                  (                  ω                  )                                            ⁢                              ∥                2                                                                        (        3        )            where ∥N(ω)∥2 is the noise power spectrum.
In comparing the ML approach to the PHAT approach it is noted that both have pros and cons. Generally, ML is robust to noise, but degrades quickly for environments with reverberation. On the other hand, PHAT is relatively robust to the reverberation/multi-path environments, but performs poorly in a noisy environment.
It is noted that in the preceding paragraphs, as well as in the remainder of this specification, the description refers to various individual publications identified by an alphanumeric designator contained within a pair of brackets. A listing of references including the publications corresponding to each designator can be found at the end of the Detailed Description section