The present invention relates generally to lip sync animation software tools and more specifically it relates to a text-derived speech animation tool for producing simple, effective animations of digital media content that educate, entertain, and inform viewers by the presentation of speaking digital characters. The invention makes the creation of digital talking characters both easy and effective to produce with professional quality and realism.
It can be appreciated that lip sync animation tools have been in use for years. Typically, prior art such as Bellomo et al. (U.S. Pat. No. 6,766,299) and Major (U.S. Published Application No. 2003/0040916), and the HIJINX product cited in Bellomo et al., exemplify the more relevant prior art, along with such animation lip-sync software products as MORPH MAGIC (for 3D Studio Max), MORPH GIZMO (for Lightwave), SHAPE SHIFTER (for Power Animator), MIMIC (made by Lip Sinc Co.), SMIRK (made by Lambsoft Co.), FAMOUS FACE (made by Famous Technology Co.), TALKMASTER (made by Comhertz Co.), and AUTOMATIC LIPSYNC (made by Fluent Speech Technologies Co.). Existing products generally can be divided into three categories, and problems with each are best described in relation to each category. The first category (A) are manual lip syncing products which generally require digital artists to manually adjust individual controls for aspects of a character mouth to sync up the animation of character lips to a pre-recorded sound voice track. Every new speech to be animated requires the same manual construction of animation information to attempt to synchronize with the voice track. The second category (B) are voice-driven products where a character animation of lip sync is automatically constructed from a processed analysis of the recorded speech of a real person. The third category (C) of products are text-driven speech animation programs, where a digital artist enters text as dialogue for the characters and the product automatically generates both a speech animation of the character and a sound dialogue track of the character voice.
The main problem with conventional lip sync animation tools are the complexity of trying to sync up lip motions to speech in a conscious manner, when in reality (as you speak), the lip motion is a totally unconscious and automatic derivative result of the intent to vocalize words. Category (A) products are most prone to this problem. The user of these products must try to consciously and logically do something which is fundamentally sub-conscious, automatic and never thought about in real life. In real life, real people speaking have the mouth motions and the voice result interlocked by their functional relationship (mouth postures create the acoustics from which speech sounds derive) whereas in digital animation processes, the recorded voice has no inherent connection to mouth animation, and so a connection must be built by a digital artist. The process is also time consuming, and changes in the speech content (the audio recording of the speech) require extensive efforts to modify the animation accordingly.
Another problem with conventional lip sync animation tools are the separation of the processes for generating the voice recording and performing the facial animation. Both Category (A) and (B) type products are prone to this problem. A voice talent person is recorded speaking the desired dialogue. This is done in a recording studio with professional audio recording specialists, equipment and facilities. The voice recording is then given to digital artists to use to create the facial animation. If at a later point in time, there is a desire or need to alter the dialogue content for any reason, the entire process of bringing voice talent into a recording studio must be repeated before the digital artist has a new sound track to work with to produce the new animation sequence. By making the voice recording a completely separate process requiring separate equipment, facilities and skilled employees, the digital animation process is unnecessarily complicated.
Another problem with conventional lip sync animation tools are the structure of the speech processing. Both Category B and C products are most prone to this problem. Products in these categories currently use an unnatural division of sound components (phonemes) to process a voice recording (category B products) or translate text into synthesized speech (type C products), as their operational system. Speech may be divided or broken down into phoneme modules (fundamental speech sounds) or syllabic modules (the syllables people actually use in real speech). A phoneme-based process is simpler in that there are less phonemes than syllables in speech, but the result is unnatural because real speech is syllabic and all dramatic quality and character of human speech derives from modulation, emphasis, and pace of the syllables, not the phonemes.
To properly appreciate the uniqueness of the invention, a detailed discussion of phonemes is appropriate. Phonemes were developed by linguists and speech pathologists over a century ago as a way of dissecting speech into audible base components, to allow speech patterns, languages, and dialects to be studied in a precise academic way. It is a system based heavily on the audible aspects of speech, and only marginally included basic references to mouth shapes as they relate to the audible elements (such as classifying some sounds as “closed consonants” because the sound must be formed with an action of closing the lips, or a plosive sound which is created by a rapid opening and thrusting of the lips). But these lip postures and motions were greatly simplified because the scholars of the time had no capacity to record the full range of motion and had no particular concern for studying the minute subtleties of the lip motions. One generalized posture sufficed for their academic purposes.
But the phoneme system was never intended as a method of reconstructing lifelike lip animations for physical robotic and digitally embodied characters. No attention was paid to the true range of lip motions and postures because the phoneme studies only applied to helping other humans talk correctly, not helping artificial character entities talk correctly, and humans are taught to speak primarily by comparing the sound their lips make to a reference sound of word pronouncement. So a linguist or speech pathologist helps a real person learn to move the lips correctly mainly by helping the person form the correct audible sound. Again, the emphasis is on the audible and correct lip motions tend to result if the audible sounds are formed correctly.
But a robotic or digitally animated character has no capacity to form sounds in and of itself. There is no acoustic relationship between it's mouth postures and the spoken audio track. So whereas teaching a human to speech with correct pronunciation can be well accomplished using phonemes, the phoneme system fails to achieve anything close to perfection when teaching or programming a robotic or digitally animated character to speak with true realistic lip sync motions, because the phoneme system was never designed or intended for this application. The detail in which the audio components were studied and categorized was never matched with an equally detailed study and documentation of how the lips move every 1/24th of a second (one film frame duration in a film sequence) through the speech cycle.
Phoneme systems intended for digital character animation are inherently flawed by this fact. They were applied to robotic and digital character speech animation because the phoneme data set is a much smaller database than syllabic data sets, and early developers needed the smaller databases to operate in the early digital systems with slow processor speeds and limited RAM. Further, the phoneme system had the apparent respectability of time honored academic use, despite the fact that in truth, the phoneme system was never intended for this application (to reconstruct lip sync motions in artificial entities appearing to talk like humans) and thus never fully developed for this purpose.
Humans speak in syllables, a group of sounds enfolded in one expulsion of breathe. All modulations of volume, pitch change, pace and fluidity that give human speech its dramatic presentation are based on syllabic emphasis.
The phoneme system dissects speech below the syllable level, losing the syllable structure in the process, and by losing the syllable structure, the phoneme system does lose the capacity to replicate human speech in its most natural form.
Using the word “master” as an example of how the phoneme system breaks the word down into five component phonemes “m” “ah” “ss” “tah” “er”, while the syllabic system breaks the word down into only two syllables, “mah” and “stir”. True lip motions to form sounds flow fluidly throughout the syllable being spoken, and dividing the motions into phoneme components isolates the motions, making the transition from one to the next abrupt and discontinuous.
While these devices may be suitable for the particular purpose to which they address, and for cartoonish levels of realism, they are not as suitable for producing truly realistic, effective animations of digital media content that educate, entertain, and inform viewers by the presentation of speaking digital characters. What is lacking is the ability of such systems to reliably produce the full natural range of speech mouth motions, match the audible component with motions accurate to 1/24th of a second, and as well demonstrate potential to replicate the irregularities of actual speaking mouths (such as asymmetrical mouth postures).
Further, phoneme-based systems such as Bellomo tend to place static mouth posture markers at intervals on the processed speech track, and then rely upon algorithmic formulas for generating curves from varied points to produce the final motion graphs that determine the mouth posture at every 1/24th of a second frame render of the animation. But these algorithmically generated curves do not correctly replicate the true mouth motions, which do not follow algorithmic protocol. True mouth movements of humans speaking follow different protocols of motion and priority based on the entire syllable of sound to be spoken, the relation of vowels before and after a consonant, and the degree of clarity and enunciation the speaker applied to the intent to speak. These multiple mitigating factors cannot be boiled down to a simplistic algorithmic formula of blending curves. So any prior art, such as Bellomo et al., is inherently compromised by the implementation of static markers of mouth shape placed in a timeline and blended by artificial protocols.
The invention takes full syllables of spoken sound and adds full motion graphs, sampled from real human speech motion and modified to include the motions of jaw and tongue (which motion capture cannot record) and interlocks these tested and verified full motion graphs (of the most realistic mouth motions to say the given syllable of speech) with the audio file of the spoken syllable. Once interlocked, the speech sound and mouth motion are in perfect sync and preserved as such throughout all use of the invention.
The invention makes the creation of digital talking characters both easy and effective to produce while operating at the syllabic level, which is the most natural operating structure to replicate true human speech. No prior art does this. The invention further takes human lip motions appropriate for perfect lip sync and interlocks those motions with the vocal audio sound tracks, in a manner that ensures they stay in sync when operated upon by the software user to create a specific animation project.
In these respects, the text-derived speech animation tool according to the present invention substantially departs from the conventional concepts and designs of the prior art, in workflow, tools, intuitive operations, and output results. In so doing, this invention provides an apparatus primarily developed for the purpose of producing realistic, dramatically entertaining, and humanly varied character animations where a digital character must speak with perfect lip sync, to enhance the believability of that character.