1. Technical Field
An “Animation Synthesizer” provides various techniques for providing automated body animation synthesis, and in particular, various techniques for using trainable probabilistic models derived from audio/video inputs of synchronized human speech and motion to drive new avatar animations based on arbitrary text and/or speech inputs.
2. Related Art
Virtual assistants, also called avatars, are virtual characters that are often used to facilitate natural and interactive human-machine communication. The role of the avatar depends on its applications: it can act as a guide, an assistant, or an information presenter, etc. The avatar appearance can be anthropomorphic (i.e., human characteristics) or cartoon-like, in a 2-D or 3-D form, depending on the output device (PC, PDA, cell phone, digital television, etc.). One advantage to using avatars is to make the user-machine interaction more natural by giving the user a sense of communicating with a real human agent. This sense can be enforced by mimicking the human-human communication faithfully, i.e., letting the avatar express realistic emotions through its facial motions and body gestures.
Unfortunately, to naturally manipulate an avatar to provide realistic facial expressions and/or body motions is generally a difficult task. For example, conventional speech motion synthesis techniques generally operated to synthesize facial motion that is synchronized with input speech. Synchronization is often achieved by breaking a sample speech signal into small units, such as phonemes, that are mapped to a set of lip poses. Then, given a new speech input that is also broken down into phonemes, the corresponding lip poses from the sample speech can then be used to generate facial animations by concatenating lip poses corresponding to the phonemes of the new speech signal.
Various improvements to this basic technique for animating facial motions based on input speech have been proposed in recent years. For example, rather than considering individual phonemes for modeling or animating lip motion, co-articulation of phonemes, generally on the order of about 10 consecutive phonemes, are evaluated to determine appropriate lip motions for animating such motion relative to a speech input.
More advanced animation techniques create compact statistical models of face motion using machine learning techniques such as Hidden-Markov Models (HMM) and Gaussian mixtures. Such techniques generally use these statistical models to develop a mapping from voice to face. Models are generally learned from a face's observed dynamics using techniques that consider factors such as position and velocity of facial features to learn a probability distribution over a set of different facial configurations. The resulting model is then used to provide animation of models based on new speech inputs.
Another example of speech-based animation techniques uses an input speech utterance to automatically synthesize matching facial motions. Animation is accomplished by first constructing a data structure referred to as an “Anime Graph” using a large set of recorded motions and associated speech. Lip-synching using co-articulation considerations is then provided by using a constrained search of the “Anime Graph” to find the best facial motions for each phoneme or group of co-articulated phonemes.