1. Field of the Invention
The present invention generally relates to visual speech synthesis and, more particularly, to a method of implementing audio driven facial animation system in any language using a speech recognition system and visemes of a different language.
2. Background Description
Audio-driven facial animation is an interesting and evolving technique in the field of human-computer interaction. The realization of a natural and friendly interface is very important in human-computer interface. Speech recognition and computer lip-reading has been developed as a means of input for information interaction with the machine. It is also important to provide a natural and friendly means to render the information. Visual speech synthesis is very important in this respect as it can provide various kinds of animated computer agents which look very realistic. Furthermore, it can also be used for distance learning applications where it can obviate the transmission of video. It can also be a useful tool for hearing impaired people to compensate for lack of auditory information.
Techniques exist for synthesizing the speech given the text as input to the system. These text to speech synthesizers work by producing a phonetic alignment of the text to be pronounced and then by generating the smooth transitions in corresponding phones to get the desired sentence. See R. E. Donovan and E. M. Eide, xe2x80x9cThe IBM Trainable Speech Synthesis Systemxe2x80x9d, International Conference on Speech and Language Processingxe2x80x9d, 1998. Recent work in bimodal speech recognition uses the fact that the audio and corresponding video signals have dependencies which can be exploited to improve the speech recognition accuracy. See T. Chen and R. R. Rao, xe2x80x9cAudio-Visual Integration in Multimodal Communicationxe2x80x9d, Proceedings of the IEEE, vol. 86, no. 5, May 1998, pp. 837-852, and E. D. Petajan, B. Bischolf, D. Bodolf, and N. M. Brooke, xe2x80x9cAn Improved Automatic Lipreading System to Enhance Speech Recognitionxe2x80x9d, Proc. OHI, 1988, pp. 19-25. A viseme-to-phoneme mapping is required to convert the score from video space to the audio space. Using such a mapping and the text-to-speech synthesis, a text-to-video synthesizer can be built. This synthesis or facial animation can be driven by text or speech audio, as the application may desire. In the later case, the phonetic alignment is generated from the audio with the help of the true word string representing the spoken word.
Researchers have tried various ways of synthesizing visual speech from a given audio signal. In the simplest method, vector quantization is used to divide the acoustic vector space into a number of subspaces (generally equal to the number of phones) and the centroid of each subspace is mapped to a distinct viseme. During the synthesis time, the nearest centroid is found for the incoming audio vector and the corresponding viseme is chosen as the output. In F. Lavagetto, Arzarello and M. Caranzano, xe2x80x9cLipreadable Frame Automation Driven by Speech Parametersxe2x80x9d, International Symposium on Speech, Image Processing and Neural Networks, 1994, ISSIPNN, the authors have used Hidden Markov Models (HMMs) which are trained using both audio and video features as follows. During the training period, viterbi alignment is used to get the most likely HMM state sequence for a given speech. Now, for a given HMM state, all the corresponding image frames are chosen and an average of their visual parameters is assigned to the HMM state. At the time of synthesis, input speech is aligned to the most likely HMM sequence using the viterbi decoding. Image parameters corresponding to the most likely HMM state sequence are retrieved, and this visual parameter sequence is animated with proper smoothing.
Recently, co-pending patent application Ser. No. 09/384,763 describes a novel way of generating the visemic alignments from an audio signal which makes use of viseme based HMM. In this approach, all the audio vectors corresponding to a given viseme are merged into a single class. Now, this viseme based audio data is used to train viseme based audio HMMs. During the synthesis time, input speech is aligned with the viseme based HMM state sequence. Now, the image parameters corresponding to these viseme based HMM state sequences are animated with the required smoothing. See also T. Ezzai and T. Poggio, xe2x80x9cMiketalk: A Talking Facial Display Based on Morphing Visemesxe2x80x9d, Proceedings of IEEE Computer Animation xe2x80x298, Philadelphia, Pa, June 1998, pp. 96-102.
All of the above approaches require training of a speech recognition system which is used for generating alignment of the input speech needed for synthesis. Further, these approaches require a speech recognition system in the language in which audio is provided to get the time alignment for the phonetic sequence of the audio signal. However, building a speech recognition system is a very tedious and time consuming task.
It is therefore an object of the present invention to provide a novel scheme to implement a language independent system for audio-driven facial animation given the speech recognition system for just one language; e.g., English. The same method can also be used for text to audiovisual speech synthesis.
The invention is based on the recognition that once the alignment is generated, the mapping and the animation hardly have any language dependency in them. Translingual visual speech synthesis can be achieved if the first step of alignment generation can be made speech independent. In the following, we propose a method to perform translingual visual speech synthesis; that is, given a speech recognition system for one language (the base language), the invention provides a method of synthesizing video with speech of any other language (the novel language) as the input.