This invention relates generally to computer generated synthetic visual speech, otherwise known as facial animation or lipsyncing. More specifically, this invention relates to methods and devices for generating synthetic visual speech based on coarticulation. This invention further relates to methods of using synthetic visual speech.
The natural production of human speech includes both auditory and visual components. The basic unit of sound for audible speech is the phoneme. Phonemes are the smallest unit of speech capable of independent auditory recognition. Similarly, visual speech is made up of visemes. Visemes are the visible corollary to phonemes. More specifically, a viseme is a visual speech representation defined by the external appearance of articulators (i.e., lips, tongue, teeth, etc.) during articulation of a corresponding phoneme. More than one phoneme may be associated with a single viseme, because many phonemes appear the same visually. Therefore, phonemes have a many to one relationship with visemes. Phonemes and visemes form the fundamental building blocks of visual speech synthesis.
Several conventional lipsyncing systems are available which produce synthetic visual speech in a variety of different ways. For example, some of these systems use a binary (on/off) method to move between visemes. In the binary method, the image of a first viseme appears until it is switched abruptly to the image of a second viseme. In the binary approach, therefore, there is no transitioning between visemes, a viseme is either completely visible or not at all visible at a given time. When visually depicting a sound moving from an /o/ to a /t/, as in the word xe2x80x9chot,xe2x80x9d for instance, the binary method displays the viseme corresponding to the /o/ until it abruptly changes to the viseme associated with the /t/. The result is very unrealistic, cartoon-like lipsyncing. An additional drawback of conventional binary systems is that they are generally limited to having only a few visemes to represent all of the possible sounds.
A better prior art approach to visual speech synthesis uses inbetweening (linear-type morphing) to transition between visemes. Morphing is a common technique for driving a 3D animation in which key frames are used to define particular configurations of a 3D model at given points in time. Morphing specifically refers to the process of interpolating between defined key frames over time to gradually transform one shape into another shape. Conventional lipsyncing systems sometimes use inbetweening (or linear interpolation based morphing) to approximate the contributions of multiple visemes to the overall appearance of the articulators at a given point in time during a viseme transition. These systems, therefore, more gradually transition between visemes by linearly combining the visemes together during the transition period. Despite the improvements that inbetweening offers over binary systems, it is still fairly unrealistic and does not accurately account for the mechanics of real speech.
A still more realistic prior art approach to the production of synthetic visual speech is parametric modeling. In parametric modeling, a specific, detailed, 3D model has parameters associated with each of the parts of the facexe2x80x94most importantly, the articulators. The whole model is defined in terms of multiple parameters, and the position of every point on the 3D model is defined by an extensive formula. Systems using parametric modeling (such as the Baldi system developed at the University of Southern California, Santa Cruz (UCSC)) have been better able to take into account contextual influences of natural visual speech production and are thereby able to produce more realistic-looking visual speech.
Unfortunately, however, parametric modeling requires the construction of a very complex graphical model. Consequently, a massive amount of work is required to create or modify these models. Also, because each of the parameters is defined in terms of a specific equation developed for that 3D model only, parametric modeling systems are 3D model dependent. These systems cannot be easily adapted for use with other 3D models. The difficulty of modifying the system to drive other 3D models makes parametric modeling rigid, complex, and expensive. Parametric modeling, therefore, does hot offer a general purpose solution to the problem of providing realistic facial animation.
U.S. Pat. No. 5,657,426 (the ""426 patent) to Waters, et al, describes various methods of producing synchronized synthetic audio and visual speech which attempt to take into account factors influencing the natural production of human speech. The ""426 patent attempts to account for these factors by interpolating between visemes using non-linear functions, such as cosine functions or equations based on Newtonian laws of motion.
Other relevant prior art publications include Massaro, D. W., Beskow, J., Cohen, M. M., Fry, C. L., Rodriguez, T., xe2x80x9cPicture My Voice: Audio to Visual Speech Synthesis using Artificial Neural Networks,xe2x80x9d Proceedings of Auditory-Visual Speech Processing, Santa Cruz, Calif., August 1999; and Pelachaud, C., xe2x80x9cCommunication and Coarticulation in Facial Animation,xe2x80x9d Doctoral Dissertation, University of, Pennsylvania, 1991. An extensive collection of references to facial animation (lipsyncing) related articles, developments, and general information can be found at the University of California, Santa Cruz internet website: http://mambo.ucsc.edu/ps1/fan.html.
The xe2x80x9cPicture My Voicexe2x80x9d article by Massaro, D. W., et al describes a synthetic visual speech production process, that is worth mentioning briefly. Particularly, the article discloses use of a neural network to produce parameters to control a lipsyncing animation. This system has several drawbacks. Its primary drawback it that it relies on parametric modeling. Accordingly, it requires the use of a parameter estimator in which the single neural network converts the audio speech input features into control parameters for manipulating a specific parameterized 3D model. It is therefore model dependent. Furthermore, articulator position and movement in this systems is fine-tuned for a specific speaker and is therefore also speaker dependent.
The industry has struggled to produce a general purpose solution to the problem of providing realistic computer-generated lipsyncing. Parametric modeling systems are 3D model dependent. Simpler, more adaptable prior art systems, on the other hand, fail to accurately account for the real-life parameters influencing human speech. What is needed, therefore, is a method and apparatus for generating realistic synthetic visual speech that is speaker, vocabulary, and model independent, and that accurately accounts for factors of natural human speech production without undue processing requirements. The industry is also in need of applications that take advantage of general purpose synthetic visual speech generation.
This invention provides a significant improvement in the art by enabling a method and apparatus for producing synthetic visual speech. The method of producing synthetic visual speech according to this invention includes receiving an input containing speech information. One or more visemes that correspond to the speech input are then identified. Next, the weights of those visemes are calculated using a coarticulation routine. The coarticulation routine includes viseme deformability information and calculates viseme weights based on a variety of factors including phoneme duration, and speech context. A synthetic visual speech output is produced based on the visemes"" weights over time (or viseme tracks). Producing the synthetic visual speech output can include retrieving a three-dimensional (3D) model (target model) for each of the visemes and morphing between selected target models based on their weights.
Several general processes are possible based on the synthetic visual speech production method of the present invention. One such process converts separate voice and text inputs containing coincidental speech information into synthetic visual speech. In that process, the text input is classified into its constituent phonemes and the corresponding visemes are identified. Calculating the visemes"" weights is accomplished by forcing an alignment between the text input and the voice input to determine each viseme""s duration and context. The viseme duration and context information is then input into a coarticulation routine that uses viseme deformability information in addition to the duration and context information to produce viseme tracks.
Another process proceeds by receiving a text-only input. The text-only input is converted into a synthesized audio and visual speech output by dividing the text input into its constituent phonemes and identifying the visemes that correspond to those phonemes. A coarticulation routine is used to calculate viseme weights for use in driving a morphing operation. The morphing operation produces blended models which are used to render the synthetic visual speech output. Because this process creates its own synthesized speech, the duration and context information it provides to the coarticulation routine is known without the need for a forced alignment process.
A still further process according to this invention proceeds by receiving a voice-only input. The visemes from the voice-only input are identified by running the voice input through a speech recognition routine. The speech recognition routine determines probable phonemes for the voice input. Visemes that correspond to the probable phonemes are then identified and their weights are calculated. A synthetic visual speech production process similar to those described above can then be conducted.
A system for producing synthetic visual speech includes a receiver to receive an input representing a speech segment. A neural network is used to divide the speech segment into its phonetic components. A coarticulation engine determines the viseme tracks for visemes corresponding to the phonetic components of the speech input using deformability information. A morphing engine morphs between successive visemes based on their tracks to enable a realistic synthetic visual speech output.
A coarticulation engine for calculating viseme tracks is configured to receive data inputs corresponding to a plurality of visemes. The data inputs represent a deformability, a context, and a duration of each of the visemes. The coarticulation engine is further configured to produce data outputs containing a weight for each of the visemes. According to one embodiment, the coarticulation engine can be a dedicated viseme estimator that takes a voice input directly and converts it into viseme tracks.
Several methods for using the synthetic speech production systems and processes of this invention are also contemplated. For instance, a method for generating a user-customizable 3D lipsyncing greeting card begins by receiving a user-defined input containing speech information. This input is converted into a customized electronic greeting card that includes a 3D visual speech animation synchronized with an audio speech output corresponding to the input. The 3D visual speech animation can be customized based on user-selected configurability options that can include selecting a character, texture mapping, supplying a background image, enabling auto-expressions, selecting emotions, selecting a singing voice, and selecting voice characteristics. Once created, the customized electronic greeting card is delivered to a recipient identified by the user.
In a method for producing a real-time computer animated lipsyncing, a voice input is supplied to a first neural network to produce a phoneme output. The phoneme output from the first neural network is provided to a second neural network to produce a viseme output. The viseme output is supplied to an animation generator to render an animated 3D lipsyncing image in real-time in substantial synchronism with an audio speech output. The lipsyncing can be produced xe2x80x9clivexe2x80x9d by reducing system buffers to that the output is delivered substantially simultaneously with the voice input. A viseme neural network can be used to reduce system buffers and minimize latency.
An apparatus for producing a real-time 3D lipsyncing animation includes a frame processor to identify frames of a voice input. A first neural network is configured to receive the frames of the voice input and to identify a probable phoneme corresponding to each of the frames. A second neural network is provided to receive the probable phonemes and identify viseme weights for one or more visemes active during each of the frames. A filter is configured to filter the viseme weights to produce a filtered and smoothed viseme track for each of the active visemes. A rendering engine is configured to render a 3D lipsyncing animation based on the viseme tracks in substantial synchronization with an audio output corresponding to the voice input. The lipsyncing animation of this embodiment can be rendered xe2x80x9clivexe2x80x9d xe2x80x94substantially simultaneous with the voice input.
A method for producing a synthesized visual communication over a network includes receiving an input containing speech information into a first networked device. The input is converted into phonetic speech components using a first neural network. The phonetic speech components are converted into weighted visual speech information (such as, but not limited to, viseme tracks) using a coarticulation routine. A 3D lipsyncing animation is then created based on the weighted visual speech information. The lipsyncing animation is displayed in substantial synchronism with an audibilization of a voice output through a second networked device. Any of the middle functions can be configured to take place on either the first or second networked devices, as desired.
A method for providing real-time synthetic communication includes providing an input containing speech information into a first one or more of a plurality of devices. The input is converted into viseme tracks using a coarticulation routine. A communication comprising an audio output and a synthesized visual speech animation is created based on the viseme tracks. The communication is output through a second one or more of the devices.
An email reader is also provided which includes a phoneme neural classifier for converting email text or email voice attachments into phonemes. An audio speech synthesizer is configured to synthesize an audio voice output based on the text input. The email reader further includes a coarticulation engine for determining weights of visemes associated with each of the phonemes and a morphing engine for morphing between target viseme models based on viseme weights. Finally, a rendering engine is provided for rendering an email lipsyncing animation based on data from the morphing engine. The email reader can also include user-customization options for allowing a user to select a lipsyncing character for the animation and a voice-type for the voice output. These customization options can be further configured to allow independent selection of the character and voice type for each of a plurality of email senders.
The foregoing and other objects, features and advantages of the invention will become more readily apparent from the following detailed description of a preferred embodiment of the invention which proceeds with reference to the accompanying drawings.