1. Technical Field
This invention is directed toward a system and method for using audio and video signals to detect a speaker. More specifically, the invention is directed toward a system and method for utilizing audio and video spatial and temporal correlation to robustly detect speakers in an audio/video sequence.
2. Background Art
The visual motion of a speaker's mouth is highly correlated with the audio data generated from the voice box and mouth. This fact has been exploited for applications such as lip/speech reading and for combined audio-visual speech recognition.
Applications where speaker detection is of importance include video conferencing, video indexing, and improving the human computer interface, to name a few. In video conferencing, knowing where someone is speaking can cue a video camera to zoom in on the speaker; it can also be used to transmit only the speaker's video in bandwidth-limited conferencing applications. Speaker detection can also be used to index video (e.g., to locate when someone is speaking), and can be combined with face recognition techniques (e.g., to identify when a specific person is speaking). Finally, speaker detection can be used to improve human computer interaction (HCI) by providing applications with the knowledge of when and where a user is speaking.
There has been a significant amount of work done in detecting faces from images and video for various purposes, and face detection techniques for the most part are well known in the art. There has also been a significant amount of work done in locating speakers using arrays of multiple microphones and sound source localization techniques. There are text-to-speech systems that utilize hand-coded phoneme-to-viseme rules to animate characters. In these hand coded rules that map phonemes to visimes extracting phonemes is error prone, as is extracting visimes. Additionally, extracting visimes requires greater image resolution than would typically be utilized in most applications where speaker detection is useful. It also requires a sophisticated model based feature extractor.
One significant work in speaker detection was described in a publication entitled “Look Who's Talking: Speaker Detection Using Video and Audio Correlation”, by Ross Cutler and Larry Davis. In this publication a method of automatically detecting a person talking using video and audio was described. The audio visual correlation was learned using a simple, fully connected time delayed neural network (TDNN). Mel cepstrum coefficients were used as the audio features and the normalized cross correlation of pixel intensities in a window were used as the video features. In this method of speaker detection, the structure of the TDNN required it to have much training. This was partly because it was fully connected 10×10 hidden layer. Additionally, the Mel cepstrum coefficients that were used to represent the audio features were quite complex. Twelve different coefficients were required to represent the audio data, which required the TDNN to process a large number of parameters in order to learn the audio visual correlation in speaking. This speaker detection system was also negatively impacted by its ability to compensate for speaker head motion. The images that were used to train the TDNN and that were input all had no head movement. This makes this system impractical for speaker detection in most real world applications.