1. Field of the Invention
The present invention relates to video teleconference technology. In particular, the present invention relates to voice-activated tracking by a camera of a speaking participant of a video teleconference.
2. Discussion of the Related Art
One feature desired in a video teleconference equipment is the ability to automatically steer the camera to a participant when he or she speaks. Clearly, before the camera can be steered, it is necessary to locate the speaking participant (xe2x80x9cspeakerxe2x80x9d) based on detection of his or her voice, and rejecting noise resulting, for example, from multiple paths and interference from other noises in the environment.
Speaker location is typically achieved by processing the sound received at a large number of microphones, such as disclosed in U.S. Pat. No. 5,737,431. One conventional methood is based on estimations of xe2x80x9ctime delays of arrivalxe2x80x9d (TDOA) of the same sound at the microphones, modeling the sound source as a point source with circular wavefronts. A second method is based upon a TDOA estimation at each pair of microphones, modeling the sound source as a far field source with planar wavefronts. In that second method, each TDOA estimate provides the direction of sound with respect to a pair of microphones, such as described in U.S. Pat. No. 5,778,082. Typically, regardless of the method used, to accurately determined the location of the speaker, a large number of microphones have to be employed to allow an optimization step (e.g., a least-square optimization) to estimate the location of the speaker. Under the prior art methods, four microphones are insufficient to reliably estimate the speaker location.
Once the position of the speaker is determined, a camera is steered towards the location. Unfortunately, because of noise and the acoustics of the environment, the position determined can vary constantly, which can result in undesirable camera movements. One solution, which is described in copending patent application, entitled xe2x80x9cVoice-activated Camera Preset Solution and Method of Operationxe2x80x9d, by Joon Maeng Ser. No. 08/647,225, filed on May 9, 1996, zooms out to cover a larger area when the speaker position is found to alternate between two adjacent regions. In addition, reflections from the ceiling, floor, the walls, and table-tops also create false source locations. Camera shots of table tops or the floor resulting from false source locations can be annoying.
The present invention provides accurate location of a speaking participant of a video conference using as few as four microphones in a 3-dimensional configuration. In one embodiment, the present invention provides a method including: (a) receiving sound from the speaking participant at each of the microphones; (b) from the received sound, computing three or more time delays, each time delay representing the difference in arrival times of the sound at a selected pair of microphones; (c) based on the positions of the microphones and the time delays, determining one or more possible positions of the speaking participant; and (d) deriving from the possible positions of the speaking participant a final position.
The present invention provides the one or more possible positions by solving a set of simultaneous equations relating the location of the speaking participant to the positions of the microphones, and relating positions of the microphones to each other; and applying the computed time delays to the solutions of the simultaneous equation. Each possible positions is found by symbolically solving a selected group of the set of simultaneous equations. In one embodiment, the present invention eliminates from the possible solutions outside a predetermined volume in space. In one embodiment, the final position is obtained by selecting the median of the possible solutions. In another embodiment, the final position is an average of the possible solutions. Further the average can be a weighted average.
In one embodiment, the time delay is computed for each selected pair of microphones, using a cross-correlation function derived from the sounds received at each microphone of the pair. In that embodiment, the received sound is prefiltered. Under one method, for example, the cross-correlation function is computed in the frequency domain. Such frequency-domain computation can be achieved using a fast Fourier transform of the sound signal received at each microphone. One implementation of such a method applies a cross-power spectrum phase procedure.
According to another aspect of the present invention, a video conference system includes: (a) a number of microphones and a camera positioned in a predetermined configuration, each microphone providing an audio signal representative of sound received at the microphone; (b) a time delay module receiving the audio signals of the microphones and based on the audio signals, providing for each pair of the microphones a time delay estimate associated with the pair of microphones; (c) a position determination module, based on the time delay estimates and the predetermined configuration, providing possible positions of a sound source, and selecting from the possible positions a final position of the sound source; and (d) a camera control module directing the camera towards the sound source using the final position of the sound source.
In that system, the time delay module estimates each time delay by computing a cross-correlation function using audio signals of the pair of microphones, after prefiltering. The same signal processing techniques as discussed above are applicable. The position determination module (a) provides solutions to a set of simultaneous equations relating the location of the speaking participant to the positions of the microphones, and relating positions of the microphones to each other; and (b) applies the computed time delays to the solutions. As discussed above, the solutions can be solved symbolically and programmed into the position determination module. Due to noise in the data, and reverberation in the environment such that an accurate estimate of the TDOA cannot be made, several possible solutions to the location of the speaking participant can usually be found. Thus, the position determination module can select as the final position the median of the solutions, the average of the solutions, a weighted average of the solutions or by some other methods of selecting the final position.
In one embodiment, the camera control module controls the tilt, pan and zoom angles of the camera.
The present invention is better understood upon consideration of the detailed description below and the accompanying drawings.