This invention relates to animation and more particularly to a system for giving an animated figure lifelike voice-controlled facial animation without time consuming production.
From lip-synching to animation, psychologists and storytellers alike have observed that there is a good deal of mutual information between vocal and facial gestures. As discussed in an article by C. Benoit, C. Abry, M.-A. Cathiard, T. Guiard-Marigny, and T. Lallouache, Read my lips: Where? How? When? And so . . . What? In 8th Int. Congress on Event Perception and Action, Marseille, France, July 1995, Springer-Verlag, facial information can add significantly to the observer""s comprehension of the formal and emotional content of speech, and is considered by some a necessary ingredient of successful speech-based interfaces. Conversely, the difficulty of synthesizing believable faces is a widely-noted obstacle to producing acceptable digital avatars, agents, and animation. The human visual system is highly specialized for interpreting facial action. As a result, a poorly animated face can be disturbing and even can interfere with the comprehension of speech as discussed by H. McGurk and J. MacDonald. Hearing lips and seeing voices, Nature, 264:746-748, 1976.
Lip-synching, a large part of facial animation, is a laborious process in which the voice track is dissected, usually by hand, features such as stops and vowels are identified, and matching face poses are scheduled in the animation track at 2-10 per second. The overwhelming majority of lip-synching research and all such commercial offerings are based on an intermediate phonemic representation, whether obtained by hand as discussed by F. Parke, A parametric model for human faces, Technical Report UTEC-CSc-75-047, University of Utah, 1974; F. Parke, A model for human faces that allows speech synchronized animation, Journal of Computers and Graphics, 1(1):1-4, 1975; Cohen and D. Massaro, Modeling co-articulation in synthetic visual speech, N. M. Thalmann and D. Thalmann, editors, Models and Techniques in Computer Animation, Springer-Verlag, 1993; T. Ezzat and T. Poggio, Miketalk: A talking facial display based on morphing visernes, Proc. of The Computer Animation Conference, June 1998; J. E. Ball and D. T. Ling, Spoken language processing in the persona conversational assistant, ESCA Workshop on Spoken Dialogue Systems, 1995; and I. Katunobu and O. Hasegawa, An active multimodel interaction system, ESCA Workshop on Spoken Dialogue Systems, 1995, other commercial offerings are based on speech recognition as discussed by J. Lewis, Automated lip-sync: Background and techniques, The Journal of Visualization and Computer Animation, 2:118-122, 1991; K. Waters and T. Levergood, Decface: A system for synthetic face applications. Multimedia Tools and Applications, 1:349-366, 1995; and C. Bregler, M. Covell, and M. Slaney, Video rewrite: Driving visual speech with audio, Proc. ACM SIGGRAPH 97, 1997. Visernes are mouth poses thought to occur commonly in speech.
Typically, phonemic or visemic tokens are mapped directly to lip poses, ignoring dynamical factors. In Video Rewrite, as discussed by the above-mentioned Bregler et al. article, vocal, but not facial, co-articulation is partially modeled via triphones, while Baldy as discussed by the above-mentioned Cohen et al. article, uses an explicit, but heuristic co-articulatory model derived from the psychological literature. Co-articulation is interaction between nearby speech segments that have observable effects on the acoustic signal and facial action. For example, one might shape a vowel differently in anticipation of the next vowel one plans to produce.
Phonemic and visemic representations are arguably a suboptimal representation of the information common to voice and face, since they obliterate the relationships between vocal prosody and upper facial gesture, and between vocal and gesture energy. Moreover, there is inherent information loss in the discretization to phonemes.
Attempts to generate lip poses directly from the audio signal have been limited to predicting vowel shapes and ignoring temporal effects such as co-articulation.
None of these methods address the actual dynamics of the face. For example, there is co-articulation at multiple time-scales, 300 ms or less in the vocal apparatus and longer on the face. Furthermore, as noted in the above-mentioned Benoit et al. article, there is evidence found that lips alone convey less than half of the visual information that human subjects can use to disambiguate noisy speech. It has been found that much of the expressive and emotional content of facial gesture happens in the upper half of the face and this is not at all addressed by speech-oriented facial animation.
In order to provide a more realistic voice driven animation without animation voice track dissection, in the subject invention a more direct mapping from voice to face is used which involves learning a model of the face""s natural dynamics during speech, then learning a mapping from vocal patterns to facial motion trajectories. An entropy-minimization technique permits learning without having to prespecify what one is to look for, with entropy being a measure of ambiguity. Note that, the hidden Markov models are used to analyze facial and vocal action and to predict how an animated version of the speaker""s head should behave. Because of the use of hidden Markov models, the subject process takes minutes, not months to produce realistic lifelike animation sequences.
This method has several important properties. First, voice is analyzed with regard to learned categories of facial gesture, rather than with regard to hypothesized categories of speech perception. Secondly, long-term dependencies such as facial co-articulations are implicitly modeled. Thirdly, a probabilistic framework allows one to find the most probable face trajectory through a sequence of facial images used in producing the animation for a whole utterance, not just for a small window of time. Finally, the output of the system is at sequence of facial control parameters that can be used to drive model-based or image-based face animations.
In one embodiment, a database of synchronized speech and video is used as the starting point. The system then models facial dynamics with a hidden Markov model. The hidden Markov model is then split into two parts: a finite state machine which models the face""s qualitative dynamics, e.g., expression to expression transition probabilities; and a set of Gaussian distributions that associate regions of facial configuration space to those states. The system then learns a second set of distributions from regions of vocal configuration space to the states occupied by the face at the same time. This combines with the facial dynamical model to become a newer hidden Markov model that is used to analyze new voice-tracks. The result is similar to a speech recognition engine, but instead of giving a most probable sequence of phonemes, the system provides a most probable sequence of facial states, using context from the full utterance for disambiguation when necessary. Using this sequence of facial states and the original set of facial output distributions, the system solves for a maximally probable trajectory through the facial states of the facial configuration space. Every possible facial expression is a point in facial configuration space. The maximally probable trajectory through this space is a sequence of expressions that best matches the target vocal track. The trajectory is then used to drive the animation.
Two features of the subject invention make this workable. First, given a state sequence, one has a closed solution for the maximally probable trajectory that mimics both the natural poses and velocities of the face. Secondly, through the use of an entropy-minimizing learning algorithm, one can estimate probabilistic models which give unique, unambiguous state sequences.
The second point is slightly subtle. It is always possible to extract a most probable state sequence from a hidden Markov model Viterbi analysis, but typically there may be thousands of other sequences that are only slightly less probable, so that the most probable sequence has only a tiny fraction of the total probability mass. In the subject system, there is a method for estimating sparse hidden Markov models that explicitly minimizes this kind of ambiguity, such that the most probable sequence has most of the probability mass.
In summary, a system for learning a mapping between time-varying signals is used to drive facial animation directly from speech, without laborious voice track analysis. The system learns dynamical models of facial and vocal action from observations of a face and the facial gestures made while speaking. Instead of depending on heuristic intermediate representations such as phonemes or visemes, the system trains hidden Markov models to obtain its own optimal representation of vocal and facial action. An entropy-minimizing training technique using an entropic prior ensures that these models contain sufficient dynamical information to synthesize realistic facial motion to accompany new vocal performances. In addition, they can make optimal use of context to handle ambiguity and relatively long-lasting facial co-articulation effects. The output of the system is a sequence of facial control parameters suitable for driving a variety of different kinds of animation ranging from warped photorealistic images to 3D cartoon characters.