1. Field of the Invention
The present invention relates to synthesis of a photo-realistic video sequence and more specifically to an improved system and method of selecting image frames from a multimedia database to generate the video sequence.
2. Discussion of Related Art
Computer-generated talking heads or virtual agents are becoming increasingly popular and technology improves their realistic appearance and sound. Recently there has been increasing interest in a sample-based, or data-driven, face animation approach. Such systems record a large corpus of talking head video sequences and create an inventory of video frames by labeling and indexing these images. During the synthesis of visual text-to-speech (TTS), the system retrieves the appropriate video frames and concatenates them into a video sequence. The process of labeling and indexing images renders the visual TTS process costly and time-consuming.
The difficulty with the concatenative approach is that selecting frames from a large corpus can be very time-consuming, depending on the size of the corpus and the length of the video sequence to be synthesized. Many companies that produce natural-looking talking heads must use recorded samples of mouth movements and therefore face the same problem of how to select frames from a large corpus.
As video synthesis produces more and more photo-realistic heads, an automatic objective measure is needed to quantify the synthesis result. Up to now, quality assessment has been done mostly through subjective tests, where multiple people look at the results. This is a tedious and time-consuming process that cannot be applied in order to tune the synthesis algorithms.
When synthesizing a talking head, the corollary technology to the visual aspect of the talking head is the audio portion, or the TTS technology. Concatenative speech synthesis has become a popular approach in TTS systems. By selecting and concatenating recorded speech segments from a database, this approach produces more natural sounding speech, compared with other methods such as model-based synthesis. In contrast, most of the visual speech synthesis, or speech-synchronous face animation systems, use a synthetic 3D face model with texture mapping. This approach requires only a small corpus of texture images and the animation process involves only texture warping, which is relatively efficient. However, it is unable to capture all the nuances of a real talking head. Moreover, warping the textures sometimes generates artifacts around the mouth region, which makes the head look artificial.
Yet another disadvantage of the traditional methods is that they fail to enable a real-time generation of photo-realistic visual TTS. The requirement of handcrafted labeling and indexing of images eliminates any opportunity to generate real-time visual TTS. As mentioned above, current methods that speed up the generation of visual TTS apply model-based techniques that look artificial and are not convincing. These deficiencies further preclude an acceptable application of visual TTS for interactive virtual agent scenarios where the agent needs to respond in real-time to the interactive input from a user.
Further, until visual TTS technologies improve, their acceptance will be hindered, given the poor impression an unacceptable experience with a virtual agent has on a company's customer. Thus, the unmet challenge in the visual TTS technology is how to provide photo-realistic talking heads in real time.