The present invention relates generally to the field of acoustics and more particularly to the problem of localizing an acoustic source for purposes of, for example, tracking the source with a video camera.
In current video-conferencing environments, a set of cameras are typically set up at a plurality of different locations to provide video of the active talker as he or she contributes to the discussion. In previous teleconferencing environments, this tedious task needed the full involvement of professional camera operators. More recently, however, artificial object tracking techniques have been advantageously used to locate and track an active talker automatically in three dimensional spacexe2x80x94namely, to determine his or her range and azimuth, as well as his or her elevation. In this manner, the need for one or more human camera operators can be advantageously eliminated.
There are several possible approaches for automatically tracking an active talker. Broadly they can be divided into two classesxe2x80x94the class of visual tracking and the class of acoustic trackingxe2x80x94depending on what particular type of information (visual or acoustic cues, respectively) is employed. Even though visual tracking techniques have been investigated for several decades and have had reasonably good success, acoustic source localization systems have certain advantages that are not present in vision-based tracking systems. For example, acoustic approaches which receive an acoustic signal omnidirectionally can advantageously act in the dark. Therefore they are able to detect and locate sound sources in the rear or which are otherwise xe2x80x9cin hidingxe2x80x9d from the view of the camera.
Humans, like most vertebrates, have two ears which form a microphone array, mounted on a mobile base (i.e., a human head). By continuously receiving and processing the propagating acoustic signals with such a binaural auditory system, we accurately and instantaneously gather information about the environment, particularly about the spatial positions and trajectories of sound sources and about their states of activity. However, the brilliant performance features demonstrated by our binaural auditory system form a big technical challenge for acoustic engineers attempting to artificially recreate the same effect, primarily as a result of room reverberation. Nonetheless, microphone array processing is a rapidly emerging technique which can play an important role in a practical solution to the active talker tracking problem.
In general, locating point sources using measurements or estimates from passive, stationary sensor arrays has had numerous applications in navigation, aerospace, and geophysics. Algorithms for radiative source localization, for example, have been studied for nearly 100 years, particularly for radar and underwater sonar systems. Many processing techniques have been proposed, with differing levels of complexity and differing restrictions. The application of such source localization concepts to the automation of video camera steering in teleconferencing applications, however, has been only recently addressed.
Specifically, existing acoustically-based source localization methods can be loosely divided into three categoriesxe2x80x94steered beamformer-based techniques, high-resolution spectral estimation-based techniques, and time delay estimation-based techniques. (See, e.g., xe2x80x9cA Practical Methodology for Speech Source Localization with Microphone Arraysxe2x80x9d by M. S. Brandstein et al., Comput., Speech, Language, vol. 2, pp. 91-126, November 1997.) With continued investigation over the last two decades, the time delay estimation-based location method has become the technique of choice, especially in recent digital systems. In particular, research efforts that have been applied to time delay estimation-based source localization techniques primarily focus on obtaining improved (in the sense of accuracy, robustness, and efficiency) source location estimators which can be implemented in real-time with a digital computer.
More specifically, time delay estimation-based localization systems determine the location of acoustic sources based on a plurality of microphones in a two-step process. In the first step, a set of time delay of arrivals (TDOAs) among different microphone pairs is calculated. That is, for each of a set of microphone pairs, the relative time delay between the arrival of the acoustic source signal at each of the microphones in the pair is determined. In the second step, this set of TDOA information is then employed to estimate the acoustic source location with the knowledge of the particular microphone array geometry. Methods which have been employed to perform such localization (i.e., the second step of the two step process) include, for example, the maximum likelihood method, the triangulation method, the spherical intersection method, and the spherical interpolation method. (See the discussion of these techniques below.)
Specifically, time delay estimation (TDE) (i.e., the first step of the two step process) is concerned with the computation of the relative time delay of arrival between different microphone sensors. In developing a time delay estimation algorithm (i.e., the first of the two steps of a time delay estimation-based acoustic source localization system), it is necessary to make use of an appropriate parametric model for the acoustic environment. Two parametric acoustic models for TDE problemsxe2x80x94namely, ideal free-field and real reverberant modelsxe2x80x94may be employed.
Generally then, the task of a time delay estimation algorithm is to estimate the model parameters (more specifically, the TDOAs) based on the model employed, which typically involves determining parameter values that provide minimum errors in accordance with the received microphone signals. In particular, conventional prior art time delay estimation-based acoustic source localization systems typically use a generalized cross-correlation (GCC) method that selects as its estimate the time delay which maximizes the cross-correlation function between time-shifted versions of the signals of the two distinct microphones. (See, e.g., xe2x80x9cThe Generalized Correlation Method for Estimation of Time Delayxe2x80x9d by C. H. Knapp et al., IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-24, pp. 320-327, August 1976.)
More specifically, in the GCC approach, the TDOA between two microphone sensors can be found by computing the cross-correlation function of their signals and selecting the peak location. The peak can be sharpened by pre-whitening the signals before computing the cross-correlation, which leads to the so-called phase transform method. Techniques have been proposed to improve the generalized cross-correlation (GCC) algorithms in the presence of noise. (See, e.g., xe2x80x9cA Pitch-Based Approach to Time-Delay Estimation of Reverberant Speechxe2x80x9d by M. S. Brandstein, Proc. IEEE ASSP Workshop Appls. Signal Processing Audio Acoustics, 1997). But because GCC is based on a simple signal propagation model in which the signals acquired by each microphone are regarded as delayed replicas of the source signal plus background noise, it has a fundamental drawback of an inability to cope well with the reverberation effect. (See, e.g., xe2x80x9cPerformance of Time-Delay Estimation in the Presence of Room Reverberationxe2x80x9d by B. Champagne et al., IEEE Trans. Speech Audio Processing, vol. 4, pp. 148-152, March 1996.) Although some improvement may be gained by cepstral prefiltering, shortcomings still remain. (See, e.g., xe2x80x9cCepstral Prefiltering for Time Delay Estimation in Reverberant Environmentsxe2x80x9d by A. Stephenne et al., Proc. IEEE ICASSP, 1995, pp. 3055-58.) Even though more sophisticated techniques exist, they tend to be computationally intensive and are thus not well suited for real-time applications. (See, e.g., xe2x80x9cModeling Human Sound-Source Localization and the Cocktail-Party-Effect,xe2x80x9d by M. Bodden, Acta Acoustica 1, pp. 43-55, 1993.) Therefore, an alternative approach to the GCC method for use in reverberant environments would be highly desirable.
In accordance with an illustrative embodiment of the present invention, a real-time passive acoustic source localization system for video camera steering advantageously determines the relative delay between the direct paths of two estimated channel impulse responses. The illustrative system employs a novel approach referred to herein as the xe2x80x9cadaptive eigenvalue decomposition algorithmxe2x80x9d (AEDA) to make such a determination, and then advantageously employs a xe2x80x9cone-step least-squaresxe2x80x9d algorithm (OSLS) for purposes of acoustic source localization. The illustrative system advantageously provides the desired features of robustness, portability, and accuracy in a reverberant environment.
More specifically, and in accordance with one aspect of an illustrative embodiment of the present invention, the AEDA technique directly estimates the (direct path) impulse response from the sound source to each of the microphones in a pair of microphones, and then uses these estimated impulse responses to determine the TDOA associated with the given pair of microphones, by determining the distance between the first peaks thereof (i.e., the first significant taps of the corresponding transfer function).
For example, in accordance with one illustrative embodiment of the present invention, a passive acoustic source localization system minimizes an error function (i.e., a difference) which is computed with the use of two adaptive filters, each such filter being applied to a corresponding one of the two signals received from the pair of microphones for which it is desired to compute a TDOA. The filtered signals are advantageously subtracted from one another to produce the error signal, which signal is minimized by a conventional adaptive filtering algorithm such as, for example, an LMS (Least-Mean-Squared) technique, such as may be used, for example, in acoustic echo cancellation systems and which is fully familiar to those of ordinary skill in the art. Then, the TDOA may be estimated by determining the xe2x80x9cdistancexe2x80x9d (i.e., the time) between the first significant taps of the two resultant adaptive filter transfer functions.
In accordance with another aspect of an illustrative embodiment of the present invention, the acoustic source location is subsequently performed (based on the resultant TDOAs) with use of an OSLS algorithm, which advantageously reduces the computational complexity but achieves the same results as a conventional spherical interpolation (SI) method. And in accordance with still another aspect of the present invention, the filter coefficients may be advantageously updated in the frequency domain using the unconstrained frequency-domain LMS algorithm, so as to take advantage of the computational efficiencies of a Fast Fourier Transform (FFT).