This invention relates to an apparatus and method for automatically producing lip-synching between facial animation and a spoken sound track.
There are a number of existing techniques for synchronizing facial animation to a spoken sound track (which, as used herein, is intended to generically include any live or recorded acoustical or electrical representation of communicative sounds). In the common rotoscoping method, an actor saying the dialog is filmed, and the actor's mouth positions are copied onto the animated character. This, and other manual techniques, have the obvious disadvantage of high cost in labor and time.
In Pearce et al., "Speech And Expression: A Computer Solution To Face Animation", Proceedings, Graphics Interface, 1986, there is described an approach to synchronized speech in which a phonetic script is specified directly by the animator. The phonetic script is also input to a phoneme-to-speech synthesizer, thereby achieving synchronized speech. This approach is appropriate when the desired speech is specified in textual form, and the quality of rule-based synthetic speech is acceptable to the purpose. A drawback of this approach is that it is difficult to achieve natural rhythm and articulation when the speech timing and pitch is defined in a script or derived by a rule-based text-to-speech synthesizer. The prosody quality can be improved somewhat by adding information such as pitch and loudness indications to the script.
In the U.S. Pat. No. 3,662,374 of Harrison et al., there is disclosed a system for automatic generation of mouth display in response to sound. Speech sounds are filtered by banks of filters, and a network of potentiometers and summers are used to generate signals that automatically control mouth width and upper and lower lip movement of an animated mouth.
In the U.S. Pat. No. 4,260,229 of Bloomstein, there is disclosed a system wherein speech sounds are analyzed, digitally encoded, and transmitted to a data memory device that contains a program for producing visual images of lip movements corresponding to the speech sounds. In this patent it is stated that an actor speaks and his voice is encoded into a phoneme code which includes sixty-two phonemes (for the English language). Each phoneme is coded at a desired number of frequencies, three being specified as an example. A commercially available speech encoder is suggested for the job. A drawback of this system, however, is the need for such a sophisticated speech encoding capability, with its attendant cost and processing time.
The U.S. Pat. No. 4,600,281 of Bloomstein discloses a method for altering facial displays in cinematic works which does not use phonetic codes.
It is among the objects of the present invention to provide an improved automatic lip-synching apparatus and method which provides realistic representations of mouth movements, but without undue cost, complexity, or operating time requirements.