1. Technical Field
This invention is directed toward a system and method for lip synchronization. More specifically, this invention is directed towards a system and method for generating a sequence of images or video of a speaker's lip movements to correlate with an audio signal of a voice using Hidden Markov Models.
2. Background Art
Movement of the lips and chin during speech is an important component of facial animation. Although the acoustic and visual information of different speakers have vastly different characteristics, they are not completely independent since lip movements must be synchronized to speech. Using voice as the input, lip synchronization synthesizes lip movements to correlate with speech signals. This technique can be used in many applications such as video-phone, live broadcast, long-distance education, and movie dubbing.
In the last ten years, much work has been done in the area of face synthesis and lip synchronization. Techniques based on the methods of Vector Quantification (VQ) [1], Neural Networks [2,3,4], Hidden Markov Models (HMMs) [5,6,7] and Linear Predictive Analysis [8] have been proposed to map speech to lip movements. Most of the systems are based on a phonemic representation (phoneme or viseme). For example, Video Rewrite [9 ] re-orders existing video frames based on recognized phonemes. Since different people speak in different tones, considerable information will be lost in a phoneme-based approach. Moreover, the phonemic representation for different languages is also different. Brand introduces a method of generating full facial animation directly from audio signals, which is based on HMMs [6]. Although this method has achieved reasonable results, its animation is rudimentary because of its use of a mean face configuration with only 26 learned states.
Restricted by algorithm efficiency, all the aforementioned systems cannot support real-time face synthesis. Recently, several methods have been proposed towards this end. Goff et al. described the first prototype of the analysis-synthesis of a speaking face running in near real-time [10]. Goff used five anatomical parameters to animate the lip model adapted to speech with a 200 ms delay between audio and video. Huang and Chen implemented a near real-time audio-to-visual mapping algorithm that maps the audio parameter set to the visual parameter set using a Gaussian Mixture Model and a Hidden Markov Model [11], but no delay data was mentioned. Morishima presented a near real-time voice-driven talking head with a 64 ms delay [12] between audio and video. He converted the LPC Cepstrum parameters into mouth shape parameters by a neural network trained by vocal features. A primary reason for the delays in these previous near real-time algorithms is that future video frames need to be processed to ensure reasonable accuracy in synthesis. This precludes these methods from being used for actual real-time lip synthesis.
It is noted that in the preceding paragraphs, as well as in the remainder of this specification, the description refers to various individual publications identified by a numeric designator contained within a pair of brackets. For example, such a reference may be identified by reciting, “reference [1]” or simply “[1]”. A listing of the publications corresponding to each designator can be found at the end of the Detailed Description section.