1. Field of the Invention
The present invention relates generally to cinematic works, and particularly to altered cinematic works where the facial motion and audio speech vocal tract dynamics of a voice dub speaker are matched to animate the facial and lip motion of a screen actor, thereby replacing the sound track of a motion picture with a new sound track in a different language.
2. Background of the Invention
In the field of audio dubbing, there are many cinematic and television works where it is desirable to have a language translation dub of an original cinematic or dramatic work, where the original recorded voice track is replaced with a new voice track. In one case, it is desirable to re-record a screen actor""s speech and dub it onto the original visual track. In another case it is desirable to have the audio dub be exactly synchronized with the facial and lip motions of the original cinematic speaker. Instead of needing to re-shoot the actor speaking the scene again, the dubbing process provides an opportunity to change the voice.
Prior approaches to generating lip and mouth motion synchronization to new voice sound tracks have largely been manual processes using computer audio and graphical processing and special effects tools. There have been recent developments towards automating the voice dubbing process using 2D based techniques to modify archival footage (Bregler), using computer vision techniques and audio speech recognition techniques to identify, analyze and capture visual motions associated with specific speech utterances. Prior approaches have concentrated on creating concatenated based synthesis of new visuals to synchronize with new voice dub tracks from the same or other actors, in the same or other languages. This approach analyzes screen actor speech to convert it into triphones and/or phonemes and then uses a time coded phoneme stream to identify corresponding visual facial motions of the jaw, lips, visible tongue and visible teeth. These single frame snapshots or multi-frame clips of facial motion corresponding to speech phoneme utterance states and transformations are stored in a database, which are then subsequently used to animate the original screen actor""s face, synchronized to a new voice track that has been converted into a time-coded, image frame-indexed phoneme stream.
Concatenated based synthesis relies on acquiring samples of the variety of unique facial expression states corresponding to pure or mixed phonemes, as triphones or diphones. The snapshot image states, or the short clip image sequences are used, as key frame facial speech motion image sets, respectively, and are interpolated for intermediate frames between key frames using optical morph techniques. This technique is limited by being essentially a symbol system that uses atomic speech and facial motion states to synthetically continuously animate facial motion by identifying the facial motions and interpolating between key frames of facial motion. The actual transformation paths from first viseme state to second viseme state are estimated using either short clips, or by hand, frame to frame, or estimated by standard morph animation techniques using various functions curved to smooth the concatenation process.
The invention comprises a method for accumulating an accurate database of learned motion paths of a speaker""s face and mouth during speech, and applying it to directing facial animation during speech using visemes.
Visemes are collected by means of using existing legacy material or by the ability to have access to actors to generate reference audio video xe2x80x9cfootagexe2x80x9d. When the screen actor is available, the actor speaks a script eliciting all the different required phonemes and co-articulations, as would be commonly established with the assistance of a trained linguist. This script is composed for each language or actor on a case by case basis. The script attempts to elicit all needed facial expressive phonemes and co-articulation points.
The sentences of the script first elicit speech-as-audio, to represent mouth shapes for each spoken phoneme as a position of the mouth and face. These sentences then elicit speech-as-motion, to derive a requisite range of facial expressive transformations. Such facial transformations include those as effected from (1) speaking words for capturing the facial motion paths as a variety of diphones and triphones needed to represent the new speech facial motions, and (2) making emotional facial gestures. Common phoneme range elicitation scripts exist as alternate sets of sentences, which are used to elicit all the basic phonemes, such as the xe2x80x9cRainbow Passagexe2x80x9d for example. To elicit all types of transformation between one phoneme and another requires using diphones, the sound segment that is the transformation between one phoneme and another, for all the phonemes. As Bregler confirmed in U.S. Pat. No. 5,880,788, triphones can have many thousands of different transformations from one phoneme sound and corresponding mouth shape to another. Triphones are used to elicit the capture of visual facial and mouth shape transformations from one speech phoneme mouth position dynamically to another phoneme mouth position and capture the most important co-articulation facial motion paths that occur during speech.
The actual motion path of a set of fixed reference points, while the mouth moves from one phoneme to another, is recorded and captured for the entire transformation between any set of different phonemes. As the mouth naturally speaks one speech phoneme and then alters its shape to speak another phoneme, the entire group of fixed reference points move and follow a particular relative course during any phoneme to phoneme transformation. Triphones capture the requisite variety of facial and mouth motion paths. Capturing many examples of many of these different phoneme to phoneme mouth shape transformations from a speaking actor is completed.
There are two sources of capture: target footage and reference footage. The target footage is the audio visual sequence. Elicitation examples are selected to accommodate the phoneme set of the language of the target footage, creating a database. This database of recorded mouth motions is used as a training base for a computer vision motion tracking system, such as the eigen-images approach described in Pentland et al, xe2x80x9cView-Based and Modular Eigenspaces for Face Recognitionxe2x80x9d, Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, June 1994, pps. 84-91. The computer vision system analyzes the training footage by means of practicing its vision analysis on the visual footage to improve fixed reference point tracking and identification for each frame. A speech recognition system or linguist is used to recognize phonemes in the training footage which can be used as index points to select frames for usage as visemes.
If the training footage is of a screen actor, it permits a computer vision motion tracking system to learn the probable optical flow paths for each fixed reference point on the face for different types of mouth motions corresponding to phoneme transitions. These identified and recorded optical reference point flow paths during speech facial motions are recorded and averaged over numerous captured examples of the same motion transformations.
A database of recorded triphone and diphone mouth transformation optical flow path groups are accumulated. For commonly used speech transformations, commonly used spline based curve fitting techniques are applied to estimate and closely match the recorded relative spatial paths and the rate of relative motions during different transformations. The estimated motion path for any reference point on the face, in conjunction with all the reference points and rates of relative motion change during any mouth shape transformation, is saved and indexed for later production usage.
An emotional capture elicitation process is effected by having the actor get in the mood of a list of different basic emotional expressions. The actor then performs and visually records examples of those emotional expressions and changes. These expressions and changes are simultaneously recorded with more than one camera position, such as face forward, one of three-quarter profile left, or three-quarter profile right side, or lower upward looking face shot, or head level in full profile. Examples are recorded as static image positions of the face, and also as short audio video clips of mouth motion. This includes recording examples of the emotional expressions with closed mouth and different open mouth vowels, for greater accuracy.
The two dimensional facial image of the actor is initially manually overlayed with a finite number of control points. These points have been established commonly within the animation industry and prior techniques applied to human actors to control facial motion animation, including Adobe""s After Effects, Pinnacle Systems"" Commotion DV 3.1 video compositor tool, Discreet""s Flame and Pixels3D animation tool, for example.
This initial establishment of fixed reference control points mapped on one or more reference images of an actor""s face is usually done manually. The points include key facial features, such as a constellation of points that outline the chin, the outside of the mouth and the inside edge of the lips. The facial and mouth expressive motion paths are tracked using common techniques established in computer vision, including the eigen-images approach described in the Pentland et al algorithm. The commonly available computer vision techniques track the motion of the predesignated control points for the face and learn to follow and annotate all image frames with the position of the mouth.
The elicited reference audio-visual database is then supplemented by the original target screen footage, to be re-synchronized to another language. The computer vision tracks and estimates the position of the control points mapped to the mouth as they move in the target production footage.
Each facial image viseme represents a specific facial expressive primitive of the voice for different phonemes. The morph target transformation subsystem has learned, from the computer vision analysis on a real actor, to acquire and tune the right optical path transform for each different actor expressive transformation. These transformations can include speech and emotional expression, and idiosyncratic gesture based expressions.
2D visemes are mapped and controlled using X/Y points, and after the texture map of a 2D viseme capture is normalized to a 3D model size, it is then mapped onto a 3D facial muzzle model, where the X/Y points representing fixed reference control points now become what are called xe2x80x98CVxe2x80x99s or control vectors, in the 3D facial model.
Sizing the viseme to fit the actor on screen facial position, scale and orientation is done using either manual or automatic techniques. The manual approach is to hand place a model wireframe overlay onto an image frame, and rotate it for correct orientation on X and Y axes (virtual camera angle) and scale it for correct matching apparent distance to the camera. The 3D wireframe overlay is used as a guide to exactly fit and wrap the 2D viseme muzzle to the 3D wireframe. The 3D wireframe is then fitted to match the 2D original actor head and face orientation. The wireframe is moved to fit, with the human operator visually adjusting the placement of the wireframe for each frame. The last match moved visemes from the prior frame are incrementally moved and scaled and rotated to fit the current frame image of the actor head and face muzzle.
The viseme CV fixed reference control points are exactly registered to the original screen actor facial position for the eyes and nose position, to place them exactly to the head position, scaled to the correct size. The moving and scaling and positioning actions are done manually using standard 3D computer graphics overlay and compositing tools, such as Pixels3D. The actor viseme associated with the new dub track for the image frame is applied. The applied visemes are aligned and registered to the correct position of the screen actor""s face either manually or preferably by using computer vision registration techniques, such as an automatic visual tracking algorithm to identify the lip and jaw edge boundaries of the screen actor""s face. Contour tracking algorithms can be used to identify the outside edges of the actor""s lips. A useful contour tracking algorithm is described in Kaas et al, xe2x80x9cSNAKES: Active Contour Modelsxe2x80x9d, Proc. Of the First International Conference On Computer Vision, London 1987.
Another computer vision tracking approach can be more suitably applied to lower resolution images, such as video, using a gray scale level based algorithm such as the eigen-images approach described in Pentland et al. The computer vision system identified locations of the boundaries of the lips and jaw permits the automatic placement and registration of new dub track visemes to fit the original screen actor facial positioning. The viseme muzzle patches are selected from the actor viseme database based on the dub speech script and speech recognition derived phoneme-to-viseme match sequence. The sequence of visemes is locked into place and correct orientation and scale in the actor image, and then the viseme muzzle patches are altered to match the light, color, and texture of the original screen actor muzzle, which alteration is enabled through mapped samples from the original screen actor images as seed texture patch references for a plurality of muzzle area texture patches. The visual texture, light and hue of the screen actor muzzle patch samples are applied to the viseme muzzle patch. The actor viseme is texture matched and match lighted for the screen actor viseme to be applied for the image frame being processed. The viseme muzzle patch boundary is matched when visually, it seamlessly stitches to the original screen actor surrounding face in the original footage. The result is such that the image frames show sequential lip motion that is now visually synchronized to the new dub speech track.
In one embodiment of the invention the sampling of texture patches from the screen actor image is automatically performed. This is accomplished at the point for the production xe2x80x98footagexe2x80x99, that speech visemes are first identified through voice recognition. The identified visemes are scaled, rotated, and locked into correct vertical and three dimensional orientation relative to the camera, to match the screen actor head and face in the performance. The whole muzzle as well as patches of light, color and texture from the sampled corresponding original actor muzzle in an image frame is collected as a set of patches. These patches are stored and mapped relative to known wireframe control vertices. These are later used as seeds to paint and smooth the actor reference viseme muzzle patch image to match the actor original performance image lighting and color. The process of texture matching the reference actor viseme muzzle patch to the original screen performance actor facial lighting can be manually or automatically accomplished. The lighting and color altered reference viseme muzzle patch sufficiently matches the original screen actor image, in terms of hue, saturation, lightness, lighting angle, shadow, skin texture, motion blur and atmospheric diffusion, for example. When texture mapping is completed for all screen visemes, the facial muzzle wireframe morph transformation is also effected for the visual texture maps. The morph process generates in-between images from viseme-to-viseme. The morphing of a viseme with another viseme over time produces a bi-viseme, which is equivalent to a speech diphone. Morphing generates intermediate image frames between visemes occurances in the footage. The morph transformation is applied to the wireframe model and also the locked visual texture map. Morphing generates any intervening image frame for mouth, lip and jaw muzzle shape positioning during speech. The applied reference visemes may or may not have their control vertices positions specially modified to represent more than the one facially expressive channel for the footage, as possibly detected within the original footage.
The present invention includes the unique provision for producing composite visemes on the fly. This is accomplished by means of using continuous dub speaker radar measurement, or applying actor speech facial motion tracking techniques, or using multi-channel character animation techniques, such as used in Pixels3D, to iteratively approximate, model and decompose original actor screen footage into usable components for creating the new actor facial animation. Facial expression is a complex compliment of usually one or more channels of expressive influence. Character animation control environments such as Pixels3D, provide the ability to combine an unlimited number of different channels of expressive extent (relative to no expression) and relative control vertices motion path transformations within an expression.
In one embodiment of the invention, radar is used as a mechanism to acquire the motion dynamics of speech as mouth and lip motions, in addition to inner vocal tract motion dynamics. The positions of the lips and the jaw of a dub audio track speaker are recorded using radar. The dynamics of motion of the dub speaker are scaled to match the dynamic range of motion of the screen actor lips and jaw. This is accomplished by means of having a reference set of image frame visemes for the screen actor that can be used as reference motion dynamic measures for various mouth and lip motions during speech. The actor visemes are scaled to the actor. The dub speech radar track measurements are absolute values that are referenced to a particular phoneme or phoneme-to-phoneme transition in time. The automatically recognized phonemes in the dub track are used to select the actor viseme corresponding to the phoneme. The actor viseme and all its associated control vertices are automatically indexed to a dub track radar measurement value. The subsequent actor viseme in the speech sequence is also indexed to the corresponding dub speech track radar measurement value. The difference between the dub speaker""s radar absolute dimensional measurements for different speech phoneme utterances, and the actor""s absolute dimensions is normalized, so that the radar track reading is scaled to match the dimensions of screen actor for the reference viseme dimensional equivalents. The radar measurement of the lips and the jaw in terms of extent and scale are modified from the dub track speaker to match the screen actor face and motion extents. The scaled radar measurement track is then used to directly articulate the morph path of all the associated control vertices between the actor visemes in sequence in the image track.
In one embodiment of the present invention, radar measurements produced from a dub speaker, a screen actor normalized dynamic motion path for the lips or the jaw for the dub track, and this normalized radar motion track for the lips, is automatically analyzed to identify deviations from the dub speaker""s reference motion set for the same phonemes, which deviations may be additionally associated with emotional expression or actor facial quirks, or other idiosyncrasies. Each discrete phoneme and emotional expression and idiosyncratic facial motion is recorded and stored into a database of face shapes, which can have an applied amplitude of zero, or no application, or up to 1, representing the maximum dimensional extent of the control vertices as a set for that discrete expression or mouth shape.
In one embodiment of the invention, radar captured lip motion measurements are stored as motion path information that is used to control the relative motion paths of the control vertices of the morph paths for the selected viseme morphing to be applied. After scaling normalization between dub speaker and screen actor mouth shape and motion extents, any remaining discrepancies between reference viseme control vertices and actual motion paths are automatically incorporated and alter the reference viseme to an offset control vertices set for that viseme application. The automatically incorporated viseme offsets to control vertices may contain emotional or other non-speech expressive content and shape. The degree of viseme offset may be given a separate channel control in a multi-channel mixer approach to animation control. Thus, any radar measurement motion tracking of the lip position may be separated into discreet component channels of shape expression.
For example, each phoneme, each vowel and each emotion and each idiosyncratic expression can have a separate channel. At any moment in time, one or more channels are being linearly mixed to produce a composite motion path of the control vertices and thus a composite facial expression and lip position. In this embodiment, radar measurements are used to control the dynamic morph transformations of the images and to generate a real-world composite measurement of multiple influence, including speech, emotion and other expression. The animation modeling tool then is applied to analyze the real-world radar captured motion paths of facial expression and lip motion and knowing the speech track, effects a decomposition of the motion track into multiple facial animation mixer channels. This is analogous to decomposition of a complex audio signal into it""s component sine waves, such as is accomplished using Fourier series analysis for example. The algorithm for identifying the discreet components of expression that combine into a composite expression, is to iteratively subtract known and expected channel components from the composite shape amplitude, and by process of approximation and reduction, identify all the probable discreet sources of expressive shape and extent.
In another embodiment of the invention the actual speech of the dub actor track is modified using vocal tract modeling or other conventional techniques such as Prosoniq""s Time Factory, to effectively morph the audio tonal and pitch characteristics of the dub actor""s voice to match the screen actor voice.