The present invention relates to audiovisual systems and, more particularly, to a system and methodology for face synthesis.
Recently there has been significant interest in face synthesis. Face synthesis refers to the generation of a facial image in accordance with a speech signal, so that it appears to a viewer that the facial image is speaking the words uttered in the speech signal. There are many applications of face synthesis including film dubbing, cartoon character animation, interactive agents, and multimedia entertainment.
Face synthesis generally involves a database of facial images in correspondence with distinct sounds of a language. Each distinct sound of the language is referred to as a xe2x80x9cphoneme,xe2x80x9d and during pronunciation of a phoneme, the mouth and lips of a face form a characteristic, visible configuration, referred to as a xe2x80x9cviseme.xe2x80x9d Typically, the facial image database includes a xe2x80x9ccodebookxe2x80x9d that maps each phoneme of a language to a corresponding viseme. Accordingly, the input speech text is segmented into phonemes, and the corresponding viseme for each phoneme is sequentially fetched from the database and displayed.
Realistic image quality is an important concern in face synthesis, and transitions from one sound to the next are particularly difficult to implement in a life-like manner because the mouth and lips are moving during the course of pronouncing a sound. In one approach, the mathematical routines are employed to interpolate a series of intermediate images from one viseme at one phoneme to the next. Such an approach, however, can result in an unnatural or distorted appearance, because the movements from one mouth and lip configuration to another are often non-linear.
In general, it is practical to store only a restricted number of phoneme/viseme sequences in the codebook. For example, image quality may be improved by storing visemes for all the allophones of a phoneme. An allophone of a phoneme is a slight, non-contrastive variation in pronunciation of the phoneme. A similar issue occurs in applying a face synthesis system originally developed for one language to speech in another language, because the other language includes additional phonemes lacking in the original language. Furthermore, the precise shape of a viseme is often dependent on the neighboring visemes, and there has been some interest in using sequences of phonemes of a given length, such as diphones.
Augmenting the codebook for every possible allophone, foreign phoneme, and phoneme sequences with their corresponding visemes consumes an unacceptably large amount of storage. In a common approach, aliasing techniques are employed in which visemes for a missing phoneme or sequence of phoneme are replaced by existing visemes in the codebook. Aliasing, however, tends to introduce artifacts at the frame boundaries, thereby reducing the realism of the final image.
Accordingly, there exists a need for a face synthesis system and methodology that generates realistic facial images. In particular, there is a need for handling transitions from one viseme to the next with improved realism. Furthermore, a need exists for generating realistic facial images for sequences of phonemes that are missing the codebook or for foreign language phonemes.
These and other needs are addressed by a method and computer-readable medium bearing instructions for synthesizing a facial image, in which a speech frame from an incoming speech signal is compared against acoustic features stored within an audio-visual codebook to produce a set of weights. These weights are used to generate a composite visual feature based on visual features corresponding to the acoustic features, and the composite visual feature is then used to synthesize a facial image. Generating a facial image based on a weighted composition of other images is a flexible approach that allows for more realistic facial images.
For example, more realistic viseme transitions during the course of pronunciation may be realized by using multiple samples of the acoustic and visual features for each entry in the audio-visual codebook, taken during the course of pronouncing a sound. Visemes for foreign phonemes can be generated by combining visemes from a combination of audio-visual codebook entries that correspond to native phonemes. For context-sensitive audio-visual codebooks with a restricted number of phoneme sequences, a weighted combination of features from visually similar phoneme sequences allows for a realistic facial image to be produced for a missing phoneme sequence.
In one embodiment, both the aforementioned aspects are combined so that each entry in the audio-visual codebook corresponds to a phoneme sequence and includes multiple samples of acoustic and visual features. In some embodiments, the acoustic features may be implemented by a set of line spectral frequencies and the visual features by the principal components of a Karhunen-Loewe transform of face points.
Additional objects, advantages, and novel features of the present invention will be set forth in part in the description that follows, and in part, will become apparent upon examination or may be learned by practice of the invention. The objects and advantages of the invention may be realized and obtained by means of the instrumentalities and combinations particularly pointed out in the appended claims.