Historically, it has been problematic to assess consistency (or any lack thereof) between facial motion or expression and speech signals in video. It has long been desirable to provide such assessment with a view to achieving objectives such as (but by no means limited to) the following objectives: the detection of “monologue” in digital video (defined as a talking face on-screen with corresponding speech being present in the soundtrack); detailed evaluation of the quality of post-production movie soundtrack editing and dubbing; detailed evaluation of the quality of lip-synchronization when developing animated characters; providing a mechanism for detecting speakers in a meeting transcription scenario with multiple cameras; providing a mechanism for detecting when a computer user is speaking to the screen for audio-visual speech recognition; and a supplementary measure for verifying that the speech input to an voice-and-face-based biometrics system corresponds to the face on the video input.
Conventional efforts involving synchrony-based solutions include: J. Hershey et al., “Using Audio-Visual Synchrony to Locate Sounds,” Proc. NIPS 1999; M. Slaney et al., “FaceSync: A Linear Operator for Measuring Synchronization of Video Facial Images and Audio Tracks”, Proc. NIPS 2001; and J. W. Fisher III, et al., “Learning Joint Statistical Models for Audio-Visual Fusion and Synchronization”, Proc. NIPS (Annual Conference on Neural Information Processing Systems) 2001. These efforts particularly involve the use of measures based on correlation or covariance under very simple model assumptions. However, these schemes are very limited due to, e.g., applying high scores to the consistent use of facial movements with audio, but without considering the plausibility of those movements either in isolation or as a temporal sequence.
In R. Cutler et al., “Look Who's Talking: Speaker Detection using Video and Audio Correlation”, Proc. ICME (International Conference on Multimedia and Expo) 2000, there is contemplated the incorporation of limited knowledge of temporal evolution of audio and lip sequences, which is a crude measure of plausibility. Slaney et al., supra, also suggests the use of a synthesis-based, generative approach in which the speech signal is used to generate a “typical” facial movement and the error between movements of this “typical” face and the face of interest is evaluated. However, these purely strong-model-based schemes will tend to give low scores to certain types of noisy data even when consistency exists.
In view of the foregoing, a need has been recognized in connection with improving upon the shortcomings and disadvantages presented by conventional efforts towards assessing consistency between facial motion and speech signals in video.