Talking heads may become the “visual dial tone” for services provided over the Internet, namely, a portion of the first screen an individual encounters when accessing a particular web site. Talking heads may also serve as virtual operators, for announcing events on the computer screen, or for reading e-mail to a user, and the like. A critical factor in providing acceptable talking head animation is essentially perfect synchronization of the lips with sound, as well as smooth lip movements. The slightest imperfections are noticed by a viewer and usually are strongly disliked.
Most methods for the synthesis of animated talking heads use models that are parametrically animated from speech. Several viable head models have been demonstrated, including texture-mapped 3D models, as described in the article “Making Faces”, by B. Guenter et al, appearing in ACM SIGGRAPH, 1998, at pp. 55-66. Parameterized 2.5D models have also been developed, as discussed in the article “Sample-Based Synthesis of Photo-Realistic Talking-Heads”, by E. Cosatto et al, appearing in IEEE Computer Animations, 1998. More recently, researchers have devised methods to learn parameters and their movements from labeled voice and video data. Very smooth-looking animations have been provided by using image morphing driven by pixel-flow analysis.
An alternative approach, inspired by recent developments in speech synthesis, is the so-called “sample-based”, “image-driven”, or “concatenative” technique. The basic idea is to concatenate pieces of recorded data to produce new data. As simple as it sounds, there are many difficulties associated with this approach. For example, a large, “clean” database is required from which the samples can be drawn. Creation of this database is problematic, time-consuming and expensive, but the care taken in developing the database directly impacts the quality of the synthesized output. An article entitled “Video Rewrite: Driving Visual Speech with Audio” by C. Bregler et al. and appearing in ACM SIGGRAPH, 1997, describes one such sample-based approach. Bregler et al. utilize measurements of lip height and width, as well as teeth visibility, as visual features for unit selection. However, these features do not fully characterize the mouth. For example, the lips and presence of the tongue, or the presence of the lower and upper teeth, all influence the appearance of the mouth. Bregler et al. is also limited in that it does not perform a full 3D modeling of the head, instead relying on a single plane for analysis, making it impossible to include cheek areas that are located on the side of the head, as well as the forehead. Further, Bregler et al. utilize triphone segments as the a priori units of video, which sometimes renders the resultant synthesis to lack a natural “flow”.