Facial Modeling
One approach to model facial geometry is to use 3D methods. Parke (Parke, F. I., “A parametric model of human faces”, Ph.D. thesis, University of Utah, 1974) was one of the earliest to adopt such an approach by creating a polygonal facial model. To increase the visual realism of the underlying facial model, the facial geometry is frequently scanned in using Cyberware laser scanners. Additionally, a texture map of the face extracted by the Cyberware scanner may be mapped onto the three-dimensional geometry (Lee, Y. et al., “Realistic modeling for facial animation,” In Proceedings of SIGGRAPH 1995, ACM Press/ACM SIGGRAPH, Los Angeles, Computer Graphics Proceedings, Annual Conference Series, ACM 55–62). Guenter (Guenter, B. et al., “Making faces”, In Proceedings of SIGGRAPH 1998, ACM Press/ACM SIGGRAPH, Orlando, Fla., Computer Graphics Proceedings, Annual Conference Series, ACM, 55–56) demonstrated recent attempts at obtaining 3D face geography from multiple photographs using photogrammetric techniques. Pighin et al. (Pighin, F. et al., “Synthesizing realistic facial expressions from photographs,” In Proccedings of SIGGRAPH 1998, ACM Press/ACM SIGGRAPH, Orlando, Fla., Computer Graphics Proceedings, Annual Conference Series, ACM, 75–84) captured face geometry and textures by fitting a generic face model to a number of photographs. Blanz and Vetter (Blanz, V. and T. Vetter, “A morphable model for the synthesis of 3D faces”, In Proceedings of SIGGRAPH 2001, ACM Press/ACM SIGGRAPH, Los Angeles, A Rockwood, Ed., Computer Graphics Proccedings, Annual Conference Series, ACM, 187–194) demonstrated how a large database of Cyberware scans may be morphed to obtain face geometry from a single photograph.
An alternative to the 3D modeling approach is to model the talking face using image-based techniques, where the talking facial model is constructed using a collection of example images captured of the human subject. These methods have the potential of achieving very high levels of videorealism and are inspired by the recent success of similar sample-based methods for audio speech synthesis (Moulines, E. and F. Charptentier, “Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones,” Speech Communication 9, 453–467, 1990).
Image-based facial animation techniques need to solve the video generation problem: How does one build a generative model of novel video that is simultaneously photorealistic, videorealistic and parsimonious? Photorealism means that the novel generated images exhibit the correct visual structure of the lips, teeth and tongue. Videorealism means that the generated sequences exhibit the correct motion, dynamics and coarticulation effects. Parsimony means that the generative model is represented compactly using a few parameters.
Bregler, Covell and Slaney (Bregler, C. et al., “Video rewrite: Driving visual speech with audio,” In Proceedings of SIGGRAPH 1997, ACM Press/ACM SIGGRAPH, Los Angeles, Computer Graphics Proceedings, Annual Conference Series, ACM, 353–360) describe an image-based facial animation system called Video Rewrite in which the video generation problem is addressed by breaking down the recorded video corpus into a set of smaller audiovisual basis units. Each one of these short sequences is a triphone segment, and a large database with all the acquired triphones is built. A new audiovisual sentence is constructed by concatenating the appropriate triphone sequences from the database together. Photorealism in Video Rewrite is addressed by only using recorded sequences to generate the novel video. Videorealism is achieved by using triphone contexts to model coarticulation effects. In order to handle all the possible triphone contexts, however, the system requires a library with tens and possibly hundreds of thousands of subsequences, which seems to be an overly redundant and non-parsimonious sampling of human lip configurations. Parsimony is thus sacrificed in favor of videorealism.
Essentially, Video Rewrite adopts a decidedly agnostic approach of animation: since it does not have the capacity to generate novel lip imagery from a few recorded images, it relies on the resequencing of a vast amount of original video. Since it does not have the capacity to model how the mouth moves, it relies on sampling the dynamics of the mouth using triphone segments.