There has been an increased interest recently in the development of text-to-audio-visual speech synthesis (TTAVS) systems, in which standard text-to-speech (TTS) synthesizers are augmented with a visual component thereby taking on the form of an image of a talking face. This interest is driven by the possible deployment of the systems as visual desktop agents, digital actors, and virtual avatars. In addition, these TTAVS systems may also have potential uses in very low bandwidth video conferencing and special effects, and would also be of interest to psychologists who wish to study visual speech production and perception.
An important aspect which might be desired of these facial TTAVS systems is video realism: the ability of the final audio-visual output to look and sound exactly as if it were produced by a real human face that was recorded by a video camera.
Unfortunately, much of the recent work in this field falls short of producing the impression of video realism. The reason for this, the inventors believe, is that most of the current TTAVS systems have chosen to integrate 3D graphics-based facial models with the audio speech synthesis. See M. M. Cohen and D. W. Massaro, "Modeling coarticulation in synthetic visual speech," in Models and Techniques in Computer Animation, pages 139-156, N. M. Thalmann and D. Thalmann, editors, Springer-Verlag, Tokyo, 1993. See also B. LeGoff and C. Benoit, "A text-to-audio-visual speech synthesizer for french," in Proceedings of the International Conference on Spoken Language Processing (ICSLP), Philadelphia, USA, October 1996. Although it is possible to improve visual realism through texture-mapping techniques, it seems that there is an inherent difficulty in modeling both the complex visual appearance of a human face and the underlying facial mouth movement dynamics using 3D graphics-based methods.
Besides the underlying facial mouth movement dynamics problems, there is difficulty in constructing a visual speech stream, where it is not sufficient to simply display the viseme images in sequence. Doing so would create the disturbing illusion o very abrupt mouth movement, since the viseme images differ from each other in shape significantly. Consequently, a mechanism of transitioning from each viseme image to every other viseme image is needed and this transition must be smooth and realistic. This need prompted a study in what is known as morphing, which is a technique adopted to create smooth and realistic viseme transitions.
Morphing was first popularized by Beier & Neely, see T. Beier and S. Neely, "Feature-based Image Metamorphosis", in SIGGRAPH '92 Proceedings, pages 35-42, Chicago, Ill., 1992, in the context of generating transitions between different faces for Michael Jackson's Black or White music video. The transformations between images occur as a warp of the first image into the second, a similar inverse warp of the second image into the first, and a final cross-dissolve or blend of the warped images. It should be noted that those involved in the early studies noticed the viability of using morphing as a method of transitioning between various facial pose, expression, and mouth position imagery.
The difficulty with traditional morphing approaches is that the specification of the warp between the images requires the definition of a set of high-level features. These features serve to ensure that the warping process preserves the desired correspondence between the geometric attributes of the objects to be morphed. For example, if one were morphing between two faces, one would want the eyes in one face to map to the eyes in the other face, the mouth in one face to map to the mouth in the other face, and so on. Consequently, the correspondence between these eyes and mouth features would need to be specified.
When morphing/warping is done by hand, however, this feature specification process can become quite tedious and complicated, especially in cases when a large amount of imagery is involved. In addition, the process of specifying the feature regions usually requires hand-coding a large number of ad-hoc geometric primitives, such as line segments, comer points, arches, circles, and meshes. Beier & Neely, in fact, make the explicit statement that the specification of the correspondence between images constitutes the most time-consuming aspect of the morph. Therefore, there is a need to automate and improve this traditional method of morphing as it is utilized in making a photo-realistic talking facial display.