Video conferencing systems are increasingly utilized to enable remote users to communicate with one another acoustically as well as visually. Thus, even though remote users are not physically present in the same place, video conferencing systems permit remote users to communicate as if they were in the same room, allowing users to emphasize their talking with visual gestures and facial expressions. The tracking of a particular conference participant in the resultant output video signal is an important aspect of video conferencing systems.
Video conferencing systems often utilize a pan-tilt-zoom (PTZ) camera to track the current speaker. The PTZ camera allows the system to position and optically zoom the camera to perform the tracking task. Initially, control systems for PTZ cameras in a video conferencing system required an operator to make manual adjustments to the camera to maintain the focus on the current speaker. Increasingly, however, users of video conferencing systems demand hands-free operation, where the control of the PTZ camera must be fully automatic.
A number of techniques have been proposed or suggested for automatically detecting a person based on audio and video information. An audio locator typically processes audio information obtained from an array of microphones and determines the position of a speaker. Specifically, when the relative microphone positions are known, the position of the sound source can be determined from the estimated propagation time differences of sound waves from a single source using well-known triangulation techniques. Similarly, a video locator typically locates one or more objects of interest in a video image, such as the head and shoulders of the speaker in a videoconference. A number of well-known techniques are available for detecting the location of a person in an image, as described, for example, in “Face Recognition: From Theory to Applications” (NATO ASI Series, Springer Verlag, New York, H. Wechsler et al., editors, 1998), incorporated by reference herein.
While conventional techniques for tracking a speaker in a video conferencing system perform satisfactorily for many applications, they suffer from a number of limitations, which, if overcome, could greatly expand the utility and performance of such video conferencing systems. Specifically, conventional video conferencing systems are generally reactive in nature. Thus, attention is focused on an event only after the event has already taken place. For example, once a new person begins to speak, there will be some delay before the camera is focused on the new speaker, preventing remote users from feeling as if they were in the same room, experiencing a natural face-to-face interaction.
In the context of face-to-face interactions, it has been observed that humans exhibit a number of signals when a person is about to begin speaking, or when a person is taking a turn from another speaker. See, for example, S. Duncan and Niederehe, “On Signaling That It's Your Turn to Speak,” J. of Experimental Social Psychology, Vol. 23(2), pp. 234-247 (1972); and S. Duncan and D. W. Fiske, Face-to-Face Interaction, Lawrence Erlbaum Publishers, Hillsdale, N.J., (1977). For example, when a person is about to take a turn from another speaker, subtle cues have been observed, such as the next-in-turn speaker leaning forward, directing his or her gaze at the current speaker or making gestures with his or her arms.
Thus, in an attempt to establish natural language communication between humans and machines, researchers have realized the level of sophistication in the ability of a person to combine different types of sensed information (cues) with contextual information and previously acquired knowledge. A need exists for an improved technique for predicting events that applies such cues in a video processing system. A further need exists for a method and apparatus that analyze certain cues, such as facial expressions, gaze and body postures, to predict the next speaker or other events. Yet another need exists for a speaker detection system that integrates multiple cues to predict the speaker who will take the next turn. A further need exists for a method and apparatus for detecting a speaker that utilizes a characteristic profile for each participant to identify which cues will be exhibited by the participant before he or she speaks.