Estimating the position of an acoustic source using a microphone array is an active area of research that has a number of practical applications. These include human-robot interaction, speech acquisition, determining the direction/location of a user in video conferencing applications as well as indoor and outdoor localization applications. Due to many factors such as environment noise and reverberation, sound source localization remains as a challenging problem.
Two of the popular approaches for acoustic source localization are the time difference of arrival (TDoA) and steered response power (SRP) as discussed in Rainer Martin et al., “Advances in digital speech transmission”, John Wiley& Sons, 2008. The TDoA-based techniques are based on estimating the time difference of arrival of a transmitted signal arriving from an acoustic source at the different spatially separated microphone pairs. This is usually performed by estimating the cross-correlation between the different pairs. The source position is then calculated based on the TDoA estimation and array geometry, usually as the intersection of multiple hyper-parabolas. This approach has the advantage of not requiring synchronization between the source and the microphones array.
The second steered response power approach is based on virtually steering the microphone array to various candidates locations for the acoustic source based on a pre-defined grid as discloses in Nilesh Madhu et al., “Acoustic source localization with microphone arrays”, Advances in Digital Speech Transmission, pages 135-170, 2008. This is based on the cross-correlation calculation between the arrival time at the different microphone-pairs in the microphone array. Specifically, the technique searches for the peak of the output power through analyzing the spatio-spectral correlation matrix derived from the signals that arrive at the different microphone pairs from the source. The location with the highest output power is considered to be the estimated source location.
To enhance the estimation of the cross-correlation, different weighting functions can be used to generalize the cross correlation calculation such as ROTH, SCOT, PHAT, ML, and the Eckart filter as described in Yiteng Arden Huang et al., “Audio signal processing for next-generation multimedia communication systems”, Springer Science & Business Media, 2007, Charles Knapp and Glifford Carter, “The generalized correlation method for estimation of time delay”, IEEE transactions on acoustics, speech, and signal processing, 24(4):320{327, 1976, Byoungho Kwon et al., “Analysis of the gcc-phat technique for multiple sources”, ICCAS 2010, pages 2070-2073. IEEE, 2010, Hong Liu and Miao Shen, “Continuous sound source localization based on microphone array for mobile robots”, 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 4332-4339. IEEE, 2010, and Patrick Marmaroli et al., “A comparative study of time delay estimation techniques for road vehicle tracking”, Acoustics 2012 and Bert Van Den Broeck et al., “Time-domain gcc-phat sound source localization for small microphone arrays” (2016). These functions sharpen the cross correlation peak and can lead to more accurate results. Similarly, to obtain the SRP, a grid of points is usually examined for the possible source location. This grid is usually taken as a rectangular grid.
Generalized Cross Correlation Given that the location of the sound source is unknown, one needs a way to estimate the TDoA. The cross-correlation (CC) approach is one of the most popular approaches to estimate the TDoA. The cross-correlation between two signals is computed where one of the signals x1 of size N is a (similar) delayed version of the other x2 by a time τ. The highest peak of the cross-correlation corresponds to T.
In real environments, there are many factors including noise and reverberation, that affect the position of the peak as discussed in Michael S Brandstein and Harvey F Silverman, “A robust method for speech signal time-delay estimation in reverberant rooms”, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 1, pages 375-378. IEEE, 1997 and Benoit Champagne et al., “Performance of time-delay estimation in the presence of room reverberation”, IEEE Transactions on Speech and Audio Processing, 4(2):148-152, 1996. To address this problem, the Generalized Cross Correlation (GCC) was introduced as discussed in Charles Knapp and Glifford Carter, “The generalized correlation method for estimation of time delay”, IEEE transactions on acoustics, speech, and signal processing, 24(4):320-327, 1976. It implements a frequency domain weighting of the cross correlation to sharpen the cross-correlation peak and make it more robust to the disturbing factors.
The time difference of arrival (TDoA), τmi;mj, refers to the difference of propagation time from the source location Xs to pairs of microphones mi and mj locations and defined by:
                              τ                                    m              i                        ,                          m              j                                      =                                                                                            X                  s                                -                                  m                  i                                                                    -                                                                          X                  s                                -                                  m                  j                                                                            c                                    (        1        )            where c is the sound propagation speed. To get an estimation value for the TDoA τmi;mj, the GCC function needs to be calculated on the received signals at mi and mj. By assuming that we have only one source, the signal received by microphones mi is shown as a follows
                                          x            i                    ⁡                      (            t            )                          =                                                            h                i                            ⁡                              (                t                )                                      ⁢                          s              ⁡                              (                                  t                  -                                      τ                    i                                                  )                                              +                                    n              i                        ⁡                          (              t              )                                                          (        2        )            
where hi is a microphone-dependent attenuation term that accounts for the propagation losses, s(t) is the source signal, τi the sound propagation delay from the source to the mi microphone, and ni is a microphone-dependent noise signal.
The GCC can be calculated efficiently using the discrete-time Fourier transform (DTFT). Given a pair of microphones, mi and mj with i≠j, the GCC between Xi(t) and Xj(t) is written as:
                                          r                                          m                i                            ,                              m                j                                              ⁡                      (            τ            )                          ⁢                  =          △                ⁢                              1                          2              ⁢                                                          ⁢              π                                ⁢                                    ∫                              -                π                            π                        ⁢                                                            X                  i                                ⁡                                  (                  ω                  )                                            ⁢                                                X                  j                  *                                ⁡                                  (                  ω                  )                                            ⁢                                                W                                      i                    ⁢                                                                                  ⁢                    j                                                  ⁡                                  (                  ω                  )                                            ⁢                              e                                  j                  ⁢                                                                          ⁢                  ω                  ⁢                                                                          ⁢                  τ                                            ⁢              d              ⁢                                                          ⁢              ω                                                          (        3        )            
where Xi(ω) and Xj(ω) are the Fourier transforms of xi(t) and xj(t) respectively and * is the conjugate operator. The Wij(ω) presents a suitable weighting function which sharpens rij(τ) for a better estimate for τij. If Wij(ω)=1 for all ω, the standard unweighted cross correlation formula is obtained.
Finally, the time difference of arrival between a pair of microphones mi and mj is estimated as:
                                          τ            ^                                              m              i                        ,                          m              j                                      ⁢                  =          △                ⁢                              arg            ⁢                                                  ⁢                                          max                τ                            ⁢                                                r                                                            m                      i                                        ,                                          m                      j                                                                      ⁡                                  (                  τ                  )                                                                          F            s                                              (        4        )            where Fs is a sampling frequency.
The Roth Weighting Function
The Roth correlation weights the cross correlation according to the Signal to Noise Ratio (SNR) value of the signal as discussed in Peter R Roth, “Effective measurements using digital signal analysis”, IEEE spectrum, 8(4):62-70, 1971. Its results approximate an optimum linear Wiener-Hopf filter as discussed in Harry L Van Trees, “Detection, estimation, and modulation theory, part I: detection, estimation, and linear modulation theory”, John Wiley & Sons, 2004. The frequency bands with a low SNR obtain a poor estimate of the cross correlation and therefore are attenuated versus high SNR bands. The Roth function is defined as follows,
                                          W            ij                    ⁡                      (            ω            )                          =                  1                                                    X                i                            ⁡                              (                ω                )                                      ⁢                                          X                i                *                            ⁡                              (                ω                )                                                                        (        5        )            
The SCOT Weighting Function
A variation of the ROTH weighting function is the Smoothed Coherence Factor (SCOT) (discussed in G Clifford Carter et al., “The smoothed coherence transform”, Proceedings of the IEEE, 61(10):1497-1498, 1973) which acts upon the same SNR-based weighting concept but allows both signals being compared to have a different spectral noise density function. It is defined as follows,
                                          W            ij                    ⁡                      (            ω            )                          =                  1                                                                      X                  i                                ⁡                                  (                  ω                  )                                            ⁢                                                X                  i                  *                                ⁡                                  (                  ω                  )                                            ⁢                                                X                  j                                ⁡                                  (                  ω                  )                                            ⁢                                                X                  j                  *                                ⁡                                  (                  ω                  )                                                                                        (        6        )            
The PHAT Weighting Function
In environments with high reverberation, the Phase Transform (PHAT) weighting function (discussed in Charles Knapp et al., “The generalized correlation method for estimation of time delay”, IEEE transactions on acoustics, speech, and signal processing, 24(4):320-327, 1976) is the most appropriate weighting function as it normalizes the amplitude of the spectral density of the two signal and uses only the phase information to compute the cross correlation. It is applied to speech signals in reverberant rooms by Brandstein and Silverman in “A robust method for speech signal time-delay estimation in reverberant rooms”, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 1, pages 375-378. IEEE, 1997. It is defined as follows:
                                          W            ij                    ⁡                      (            ω            )                          =                  1                                                                                  X                  i                                ⁡                                  (                  ω                  )                                            ⁢                                                X                  j                  *                                ⁡                                  (                  ω                  )                                                                                                    (        7        )            
However, the GCC-PHAT achieves very good performance when the SNR of the signal is high but deteriorates when the noise level increases.
The ML Weighting Function
Another weighting function of interest is the Hannan and Thomson's weighting function (disclosed in Michael S Brandstein et al., “A practical time-delay estimator for localizing speech sources with a microphone array, Computer Speech and Language, 9(2):153-170, 1995 and Charles Knapp and Glifford Carter, “The generalized correlation method for estimation of time delay”, IEEE transactions on acoustics, speech, and signal processing, 24(4):320-327, 1976) that is also known as Maximum Likelihood (ML) correlation. This weighting function also tries to maximize the SNR ratio of the signal. For speech applications, the approximation may be:
                                          W            ij                    ⁡                      (            ω            )                          =                                                                                            X                  i                                ⁡                                  (                  ω                  )                                                                    ⁢                                                                          X                  j                  *                                ⁡                                  (                  ω                  )                                                                                                                                                                            N                    i                                    ⁡                                      (                    ω                    )                                                                              2                        ⁢                                                                                                X                    j                                    ⁡                                      (                    ω                    )                                                                              2                        ⁢                                                                                                N                    j                                    ⁡                                      (                    ω                    )                                                                              2                        ⁢                                                                                                X                    i                                    ⁡                                      (                    ω                    )                                                                              2                                                          (        8        )            where Ni(ω) is the noise power spectra.
The Eckart Weighting Function
The Eckart filter (disclosed in Carl Eckart, “Optimal rectifer systems for the detection of steady signals” (1952)) maximizes the deflection criterion, i.e. the ratio of the change in mean correlation output due to the signal present compared to the standard deviation of correlation output due to noise alone. The weighting function achieving this is:
                                          W            ij                    ⁡                      (            ω            )                          =                                                            S                i                            ⁡                              (                ω                )                                      ⁢                                          S                i                *                            ⁡                              (                ω                )                                                                                        N                i                            ⁡                              (                ω                )                                      ⁢                                          N                i                *                            ⁡                              (                ω                )                                      ⁢                                          N                j                            ⁡                              (                ω                )                                      ⁢                                          N                j                *                            ⁡                              (                ω                )                                                                        (        9        )            
where Si(ω) is the speech power spectra.
Steered Response Power
The steered response power (SRP) is a beamforming-based approach (disclosed in Maximo Cobos et al., “A survey of sound source localization methods in wireless acoustic sensor networks”, Wireless Communications and Mobile Computing, 2017). SRP aims to maximize the power of the received sound using a filter-and-sum beamformer steered to a set of candidates defined by a predefined spatial grid. This grid is usually taken as a rectangular grid (as disclosed in the Maximo article above and Joseph Hector DiBiase, “A high-accuracy, low-latency technique for talker localization in reverberant environments using microphone arrays”, Brown University Providence, R.I., 2000). The steered response power (SRP) at a spatial point x=[x; y; z]T on the grid can be defined as:
                              P          ⁡                      (            x            )                          =                              ∑                                          m                1                            =              1                        M                    ⁢                                    ∑                                                m                  2                                =                1                            M                        ⁢                                          r                                                      m                    i                                    ,                                      m                    j                                                              ⁡                              (                                                      τ                                                                  m                        1                                            ,                                              m                        2                                                                              ⁡                                      (                    x                    )                                                  )                                                                        (        10        )            
where r is the cross-correlation defined in Equation 3. Note that the SRP accuracy depends on both the chosen weighting function for calculating the GCC as well as the chosen grid points to evaluate.
Thus, it is desirable to provide a system and method for sound source localization that improves the accuracy and speed of the above known technique and it is to this end that the disclosure is directed.