The present invention relates to the field of talking-head animations and, more particularly, to the utilization of a unit selection process from databases of audio and image units to generate a photo-realistic talking-head animation.
Talking heads may become the xe2x80x9cvisual dial tonexe2x80x9d for services provided over the Internet, namely, a portion of the first screen an individual encounters when accessing a particular web site. Talking heads may also serve as virtual operators, for announcing events on the computer screen, or for reading e-mail to a user, and the like. A critical factor in providing acceptable talking head animation is essentially perfect synchronization of the lips with sound, as well as smooth lip movements. The slightest imperfections are noticed by a viewer and usually are strongly disliked.
Most methods for the synthesis of animated talking heads use models that are parametrically animated from speech. Several viable head models have been demonstrated, including texture-mapped 3D models, as described in the article xe2x80x9cMaking Facesxe2x80x9d, by B. Guenter et al, appearing in ACM SIGGRAPH, 1998, at pp. 55-66. Parameterized 2.5D models have also been developed, as discussed in the article xe2x80x9cSample-Based Synthesis of Photo-Realistic Talking-Headsxe2x80x9d, by E. Cosatto et al, appearing in IEEE Computer Animations, 1998. More recently, researchers have devised methods to learn parameters and their movements from labeled voice and video data. Very smooth-looking animations have been provided by using image morphing driven by pixel-flow analysis.
An alternative approach, inspired by recent developments in speech synthesis, is the so-called xe2x80x9csample-basedxe2x80x9d, xe2x80x9cimage-drivenxe2x80x9d, or xe2x80x9cconcatenativexe2x80x9d technique. The basic idea is to concatenate pieces of recorded data to produce new data. As simple as it sounds, there are many difficulties associated with this approach. For example, a large, xe2x80x9ccleanxe2x80x9d database is required from which the samples can be drawn. Creation of this database is problematic, time-consuming and expensive, but the care taken in developing the database directly impacts the quality of the synthesized output. An article entitled xe2x80x9cVideo Rewrite: Driving Visual Speech with Audioxe2x80x9d by C. Bregler et al. and appearing in ACM SIJGGRAPH, 1997, describes one such sample-based approach. Bregler et al. utilize measurements of lip height and width, as well as teeth visibility, as visual features for unit selection. However, these features do not fully characterize the mouth. For example, the lips and presence of the tongue, or the presence of the lower and upper teeth, all influence the appearance of the mouth. Bregler et al. is also limited in that it does not perform a full 3D modeling of the head, instead relying on a single plane for analysis, making it impossible to include cheek areas that are located on the side of the head, as well as the forehead. Further, Bregler et al. utilize triphone segments as the a priori units of video, which sometimes renders the resultant synthesis to lack a natural xe2x80x9cflowxe2x80x9d.
The present invention relates to the field of talking-head animations and, more particularly, to the utilization of a unit selection process from databases of audio and image units to generate a photo-realistic talking-head animation.
More particularly, the present invention relates to a method of selecting video animation snippets from a database in an optimal way, based on audio-visual cost functions. The animations are synthesized from recorded video samples of a subject speaking in front of a camera, resulting in a photo-realistic appearance. The lip-synchronization is obtained by optimally selecting and concatenating variable-length video units of the mouth area. Synthesizing a new speech animation from these recorded units starts with audio speech and its phonetic annotation from a text-to-speech synthesizer. Then, optimal image units are selected from the recorded set using a Viterbi search through a graph of candidate image units. Costs are attached to the nodes and the arcs of the graph, computed from similarities in both the acoustic and visual domain. Acoustic similarities may be computed, for example, by simple phonetic matching. Visual similarities, on the other hand, require a hierarchical approach that first extracts high-level features (position and sizes of facial parts), then uses a 3D model to calculate the head pose. The system then projects 3D planes onto the image plane and warps the pixels bounded by the resulting quadrilaterals into normalized bitmaps. Features are then extracted from the bitmaps using principal component analysis of the database. This method preserves coarticulation and temporal coherence,.producing smooth, lip-synched animations.
In accordance with the present invention, once the database has been prepared (off-line), on-line (i.e., xe2x80x9creal timexe2x80x9d) processing of text input can then be used to generate the talking-head animation synthesized output. The selection of the most appropriate video frames for the synthesis is controlled by using a xe2x80x9cunit selectionxe2x80x9d process that is similar to the process used for speech synthesis. In this case, audio-visual unit selection is used to select mouth bitmaps from the database and concatenate them into an animation that is lip-synched with the given audio track.
Other and further aspects of the present invention will become apparent during the course of the following discussion and by reference to the accompanying drawings.