The present invention relates to the field of photo-realistic imaging. More particularly, the invention relates to a method for generating talking heads in a text-to-speech synthesis application which provides for realistic-looking coarticulation effects.
Visual TTS, the integration of a xe2x80x9ctalking headxe2x80x9d into a text-to-speech (xe2x80x9cTTSxe2x80x9d) synthesis system, can be used or a variety of applications. Such applications include, for example, model-based image compression for video telephony, presentations, avatars in virtual meeting rooms, intelligent computer-user interfaces such as E-mail reading and games, and many other operations. An example of an intelligent user interface is an E-mail tool on a personal computer which uses a talking head to express transmitted E-mail messages. The sender of the E-mail message could annotate the E-mail message by including emotional cues with or without text. Thus, a boss wishing to send a congratulatory E-mail message to a productive employee can transmit the message in the form of a happy face. Different emotions such as anger, sadness, or disappointment can also be emulated.
To achieve the desired effect, the animated head must be believable. That is, it must look real to the observer. Both the photographic aspect of the face (natural skin appearance, realistic shapes, absence of rendering artifacts) and the lifelike quality of the animation (realistic head and lip movements in synchrony with sound) must be perfect, because humans are extremely sensitive to the appearance and movement of a face.
Effective visual TTS can grab the attention of the observer, providing a personal user experience and a sense of realism to which the user can relate. Visual TTS using photorealistic talking heads, the subject of the present invention, has numerous benefits, including increased intelligibility over other methods such as cartoon animation, increased quality of the voice portion of the TTS system, and a more personal user interface.
Various approaches exist for realizing audio-visual TTS synthesis algorithms. Simple animation or cartoons are sometimes used. Generally, the more meticulously detailed the animation, the greater its impact on the observer. Nevertheless, because of their artificial look, cartoons have a limited effect. Another approach for realizing TTS methods involves the use of video recordings of a talking person. These recordings are integrated into a computer program. The video approach looks more realistic than the use of cartoons. However, the utility of the video approach is limited to situations where all of the spoken text is known in advance and where sufficient storage space exists in memory for the video clips. These situations simply do not exist in the context of the more commonly employed TTS applications.
Three-dimensional modeling can also be used for many TTS applications. These models provide considerable flexibility because they can be altered in any number of ways to accommodate the expression of different speech and emotions. Unfortunately, these models are usually not suitable for automatic realization by a computer. The complexities of three-dimensional modeling are ever-increasing as present models are continually enhanced to accommodate a greater degree of realism. Over the last twenty years, the number of polygons in state-of-the-art three-dimensional synthesized scenes has grown exponentially. Escalated memory requirements and increased computer processing times are unavoidable consequences of these enhancements. To make matters worse, synthetic scenes generated from the most modern three-dimensional modeling techniques often still have an artificial look.
With a view toward decreasing memory requirements and computation times while preserving realistic images in TTS methodologies, practitioners have implemented various sample-based photorealistic techniques. These approaches generally involve storing whole frames containing pictures of the subject, which are recalled in the necessary sequence to form the synthesis. While this technique is simple and fast, is too limited in versatility. That is, where the method relies on a limited number of stored frames to maintain compatibility with the finite memory capability of the computer being used, this approach cannot accommodate sufficient variations in head and facial characteristics to promote a believable photorealistic subject. The number of possible frames for this sample-based technique is consequently too limited to achieve a highly realistic appearance for most conventional computer applications.
FIG. 1 is a chart illustrating the various approaches used in TTS synthesis methodologies. The chart shows the tradeoff between realism and flexibility as a function of different approaches. The perfect model (block 130) would have complete flexibility because it could accommodate any speech or emotional cues whether or not known in advance. Likewise, the perfect model would look completely realistic, just like a movie screen. Not surprisingly, there are no perfect models.
As can be seen, cartoons (block 100) demonstrate the least amount of flexibility, since the cartoon frames are all predetermined, and as such, the speech to be tracked must be known in advance. Cartoons are also the most artificial, and hence the least realistic-looking. Movies (block 110) or video sequences provide for a high degree of realism. However, like cartoons, movies have little flexibility since their frames depend upon a predetermined knowledge of the text to be spoken. The use of three-dimensional modeling (block 120) is highly flexible, since it is fully synthetic and can accommodate any facial appearance and can be shown from any perspective (unlike models which rely on two dimensions). However, because of its synthetic nature, three-dimensional modeling still looks artificial and consequently scores lower on the realism axis.
Sample-based techniques (block 140) represent the optimal tradeoff, with a substantial amount of realism and also some flexibility. These techniques look realistic because facial movements, shapes, and colors can be approximated with a high degree of accuracy and because video images of live subjects can be used to create the sample-based models. Sample based techniques are also flexible because a sufficient amount of samples can be taken to exchange head and facial parts to accommodate a wide variety of speech and emotions. By the same token, these techniques are not perfectly flexible because memory considerations and computation times must be taken into account, which places practical limits on the number of samples used (and hence the appearance of precision) in a given application.
To date, no animation technique exists for generating lifelike characters that could be automatically realized by a computer and that would be perceived by an observer as completely natural. Practitioners who have nevertheless sought to approximate such techniques have met with some success. Where practitioners employ a limited range of views and actions in a sample-based TTS synthesis (thereby minimizing memory requirements and computation times), photorealistic synthesis is coming within reach of today""s technology. For example, the practitioner may implement a method which relies on frontal views of the head and shoulders, with limited head movements of 30 degree rotations and modest translations. While such a method has a limited versatility, often applications exist which do not require greater capability (e.g., some computer-user interface applications). Limited photorealistic synthesis methods can be a viable alternative for such applications.
Sample-based methods for generating photo-realistic characters are described in currently-pending patent applications entitled xe2x80x9cMulti-Modal System For Locating Objects In Imagesxe2x80x9d, Graf et al. U.S. patent application Ser. No. 08/752109, filed Nov. 20, 1996, and xe2x80x9cMethod For Generating Photo-realistic Animated Charactersxe2x80x9d, Graf et al. U.S. patent application Ser. No. 08/869531, filed Jun. 6, 1997, each of which is hereby incorporated by reference as if fully set forth herein. These applications describe methods involving the capturing of samples which are decomposed into a hierarchy of shapes, each shape representing a part of the image. The shapes are then overlaid in an designated manner to form the whole image.
For a TTS application, samples of sound, movements and images are captured while the subject is speaking naturally. These samples are processed and stored in a library. Image samples are later recalled in synchrony with the sound and concatenated together to form the animation.
One of the most difficult problems involved in producing an animated talking head for a TTS application is generating sequences of mouth shapes that are smooth and that appear to truly articulate a spoken phoneme in synchrony with the sound with which it is associated. This problem derives largely from the effects of coarticulation. Coarticulation means that mouth shapes depend not only on the phoneme to be spoken, but also on the context in which the phoneme appears. More specifically, the mouth shape depends on the phonemes spoken before, and sometimes after, the phoneme to be spoken. Coarticulation effects give ruse to the necessity to use different mouth shapes for the same phoneme, depending upon the context in which the phoneme is spoken.
Thus, the following needs exist in the art with respect to TTS technology: (1) the need for a sample-based methodology for generating talking heads to form an animated sequence which looks natural and which requires a minimal amount of memory and processing time, and thus can be automatically realized on a computer; (2) the need for such a methodology which has great flexibility in accommodating a multitude of facial appearances, mouth shapes, and emotions; and (3) the need for such a methodology which takes into account coarticulation effects.
Accordingly, an object of the invention is to provide a technique for generating lifelike, natural characters for a text-to-speech application that can be implemented automatically by a computer, including a personal computer.
Another object of the invention is to disclose a method for generating photo-realistic characters for a text-to-speech application that provides for smooth coarticulation effects in a practical and efficient model which can be used in a conventional TTS environment.
Another object of the invention is to provide a sample-based method for generating talking heads in TTS applications which is flexible, produces realistic images, and has reasonable memory requirements.
These and other objects of the invention are accomplished in accordance with the principles of the invention by providing a sample-based method for synthesizing talking heads in TTS applications which factors coarticulation effects into account. The method uses an animation library for storing parameters representing sample-based images which can be combined and/or overlaid to form a sequence of frames, and a coarticulation library for storing mouth parameters, phoneme transcripts, and timing information corresponding to phoneme sequences.
For sample-based synthesis, samples of sound, movements and images are captured while the subject is speaking naturally. The samples capture the characteristics of a talking person, such as the sound he or she produces when speaking a particular phoneme, which he or she articulates transitions between phonemes. The image samples are processed and stored in a compact animation library.
In a preferred embodiment, image samples are processed by decomposing them into a hierarchy of segments, each segment representing a part of the image. The segments are called from the library as they are needed, and integrated into a whole image by an overlaying process.
A coarticulation library is also maintained. Small sequences of phonemes are recorded including image samples, acoustic samples and timing information. From these samples, information is derived such as rules or equations which are used to characterize the mouth shapes. In one embodiment, specific mouth parameters are measured from the image samples comprising the phoneme sequence. These mouth parameter sets, which correspond to different phoneme sequences, are stored into the coarticulation library. Based on the mouth parameters, the animation sequences are synthesized in synchrony with the associated sound by concatenating corresponding image samples from the animation library. Alternatively, rules or equations derived from the phoneme sequence samples are stored in the coarticulation library and used to emulate the necessary mouth shapes for the animated synthesis.