The present invention relates to a method for modeling three-dimensional objects and, more particularly, to modeling three-dimensional objects using a data-driven approach with separate three-dimensional image planes so as to synthesize three-dimensional, photo-realistic animations.
Animated characters, and in particular, talking heads, are playing an increasingly important role in computer interfaces. An animated talking head immediately attracts the attention of a user, can make a task more engaging, and can add entertainment value to an application. Seeing a face makes many people feel more comfortable interacting with a computer. With respect to learning tasks, several researchers have reported that animated characters can increase the attention span of the user, and hence improve learning results. When used as avatars, lively talking heads can make an encounter in a virtual world more engaging. In today""s practice, such heads are usually either recorded video clips of real people or cartoon characters lip-synching synthesized text.
Even though a cartoon character or robot-like face may provide an acceptable video image, it has been found that people respond to nothing as strongly as a real face. For an educational program, for example, a real face is preferable. A cartoon face is associated with entertainment, not to be taken too seriously. An animated face of a competent teacher, on the other hand, can create an atmosphere conducive to learning and therefore increase the impact of such educational software.
Generating animated talking heads that look like real people is a very challenging task, and so far all synthesized heads are still far from reaching this goal. To be considered natural, a face has to be not just photo-realistic in appearance, but must also exhibit realistic head movements, emotional expressions, and proper plastic deformations of the lips synchronized with the speech. Humans are trained from birth to recognize faces and facial expressions and are therefore highly sensitive to the slightest imperfections in a talking face.
Many different systems exist in the prior art for modeling the human head, achieving various degrees of photo-realism and flexibility, but relatively few have demonstrated a complete talking head functionality. Most approaches use 3D meshes to model in fine detail the shape of the head. See, for example, an article entitled xe2x80x9cAutomatic 3D Cloning and Real-Time Animation of a Human Facexe2x80x9d, by M. Escher et al., appearing in the Proceedings of Computer Animation, pp. 58-66, 1997. These models are created by using advanced 3D scanning techniques, such as a CyberWare range scanner, or are adapted from generic models using either optical flow constraints or facial features labeling. Some of the models include information on how to move vertices according to physical properties of the skin and the underlying muscles. To obtain a natural appearance, they typically use images of a person that are texture-mapped onto the 3D model. Yet, when plastic deformations occur, the texture images are distorted, resulting in visible artifacts. Another difficult problem is modeling of hair and such surface features as grooves and wrinkles. These are important for the appearance of a face, and yet are only marginally (if at all) modeled by most of the prior art systems. The incredible complexity of plastic deformations in talking faces makes precise modeling extremely difficult. Simplification of the models results in unnatural appearances and synthetic-looking faces.
An alternative approach to the 3D modeling is based on morphing between 2D images. These techniques can produce photo-realistic images of new shapes by interpolating between two existing shapes. Morphing of a face requires precise specifications of the displacements of many points in order to guarantee that the results look like real faces. Most morphing techniques therefore rely on a manual specification of the morph parameters, as discussed in the article xe2x80x9cView Morphingxe2x80x9d, by S. M. Seitz et al., appearing in Proceedings of SIGGRAPH ""96, pp. 21-30, July 1996. Others have proposed image analysis methods where the morph parameters are determined automatically, based on optical flow. While this approach gives an acceptable solution to generating new views from a set of reference images, the proper reference images must still be found to initialize the process. Moreover, since the techniques are based on 2D images, the range of expressions and movements they can produce is rather limited.
Recently, there has been a surge of interest in sample-based techniques (also referred to as data-driven) for synthesizing photo-realistic scenes. These techniques generally start by observing and collecting samples that are representative of a signal to be modeled. The samples are then parameterized so that they can be recalled at synthesis time. Typically, samples are processed as little as possible to avoid distortions. One of the early successful applications of this concept is QuickTime(copyright) VR, as discussed in the article xe2x80x9cQuickTime(copyright) VRxe2x80x94An Image-Based Approach to Virtual Environment Navigationxe2x80x9d, by E. L. Chen et al., appearing in Proceedings SIGGRAPH ""95, pp. 29-38, July 1995. The Chen et al. system allows panoramic viewing of scenes as well as examining objects from all angles. Samples are parameterized by the direction from which they were recorded and stored in a two-dimensional database.
Recently, other researchers have explored ways of sampling both texture and 3D geometry of faces, producing realistic animations of facial expressions. One example of such sampling is discussed in an article entitled xe2x80x9cSynthesizing Realistic Facial Expressions from Photographsxe2x80x9d, by F. Pighin et al., appearing in Proceedings SIGGRAPH ""98, pp. 75-84, July 1998. The Pighin et al. system uses multiple cameras or facial markers to derive the 3D geometry and texture of the face in each frame of video sequences. However, deriving the exact geometry of such details as groves, wrinkles, lips and tongue as they undergo plastic deformations prove a difficult task. Extensive manual measuring in the images is required, resulting in a labor-intensive capture process. Textures are processed extensively to match the underlying 3D model and may loose some of their natural appearance. None of these prior art systems have yet been demonstrated for speech reproduction.
A talking-head synthesis technique based on recorded samples that are selected automatically has been proposed in the article xe2x80x9cVideo Rewrite: Driving Visual Speech with Audioxe2x80x9d, by C. Bregler et al, appearing in Proceedings SIGGRAPH ""97, pp. 353-360, July 1997. The Bregler et al. system can produce videos of real people uttering text they never actually said. It uses video snippets of tri-phones (three sequential phonemes) as samples. Since these video snippets are parameterized with the phoneme sequence, the resulting database is very large. Moreover, this parameterization can only be applied to the mouth area, precluding the use of other facial parts, such as eyes and eyebrows, which are known to carry important conversational clues.
T. Ezzat et al., in an article entitled xe2x80x9cMikeTalk: A Talking Facial Display Based on Morphing Visemesxe2x80x9d, appearing in the Proceedings of Computer Animation, pp. 96-102, June 1998, describe a sample-based talking head system that uses morphing to generate intermediate appearances of mouth shapes from a very small set of manually selected mouth samples. While morphing generates smooth transitions between mouth samples, this system does not model the whole head and does not synthesize head movements and facial expressions. Others have presented a sample-based talking head that uses several layers of 2D bit-planes as a model. Neither facial parts nor the whole head are modeled in 3D and, therefore, the system is limited in what new expressions and movements it can synthesize.
Thus, a need remains in the art for a method of modeling three-dimensional objects in general and, particularly, for an animated talking head model that is photo-realistic and is capable of producing whole head movements and realistic facial expressions for a variety of computer graphic applications.
The need remaining in the prior art is addressed by the present invention, which relates to a method for modeling three-dimensional objects and, more particularly, to modeling three-dimensional objects using a data-driven approach with separate three-dimensional image planes so as to synthesize three-dimensional, photo-realistic animations.
In accordance with the present invention, a data-driven approach is used, where a three-dimensional object, such as a talking person, is defined by a set of three-dimensional planes that approximate the shape and surrounding area of the object. The object is then recorded on video and image recognition is applied to automatically extract bitmaps of each three-dimensional plane. In the case of modeling a human face, the set of three-dimensional planes correspond to a set of pre-defined facial parts. These bitmaps are then normalized and parameterized before being entered into a database. For the synthesis of a human head, a text-to-speech synthesizer provides the audio track, as well as a phoneme string and trajectory that calculates motion for all the facial parts, including the whole head. These trajectories provide the parameters for selecting the proper bitmaps from the database. Smoothing and blending are applied to these xe2x80x98stringsxe2x80x99 of bitmaps to eliminate hard transitions and create a seamless animation for each facial part. The result is a talking head that resembles very closely the person who was originally recorded.
It is an aspect of the present invention that the use of sample-based modeling from video images preserves a high level of detail in the appearance of the object. For example, by recording real movements of a head and lips, and reusing them for the synthesis, a model is obtained that is able to produce realistic lip and head movements, as well as emotional expressions.
In defining the inventive modeling of a three-dimensional object, the flexibility of 3D models is combined with the realism of images. A key problem with prior art sample-based techniques, as discussed above, is the ability to control the number of image samples that need to be recorded and stored. For example, a face""s appearance changes due to talking, emotional expressions and head orientation, leading to a combinatorial explosion in the number of different appearances. In accordance with the teachings of the present invention, the number of samples is kept at a manageable level by dividing the object into a hierarchy of parts (each part defined by a three-dimensional plane), where each part is modeled independently. This independent modeling results in a compact model that can create animations with head movements, speech articulation and different emotional expressions.
In a preferred embodiment of the present invention for generating a xe2x80x9ctalking headxe2x80x9d model, the head and its facial parts are first modeled and a suitable sample process is used to capture the data. In order to capture accurately realistic speech postures, human subjects speak short text sequences in front of a camera. A face recognition system then automatically analyzes this video footage and selects the proper samples. The needed bitmaps are then extracted from the video frames and normalized and parameterized for easy access in a database. Finally, the synthesis of the talking head animation is driven by a string of phonemes to create the photo-realistic talking head.
Other and further aspects and embodiments of the present invention will become apparent during the course of the following discussion and by reference to the accompanying drawings.