Lip synchronization (lip sync) is a technical term for matching a character's lip movements with recorded speech. Most people have an awareness of correct mouth movement when people speak as in an animation film, and they can recognize bad lip-sync when they see it as well. Viewers expect a realistic level of lip sync.
There are many devices/approaches to generating lip synchronization to new audio tracks. Most of them utilize computer automation by using speech recognition and/or visual capturing techniques. These types of systems work for an actor where no perspective shift or change occurs, meaning the camera does not move around the actor or the actor does not turn his head. These approaches, however, will not work once a perspective shift does occur, and perspective shifts occur throughout any movie. Perspective shift (camera or head movement) shots require the use of 3D match moving, creating new 3D lip geometry, skin texturing, etc., and then compositing the new computer-generated image (cgi) created lip.
In one case, the system assigns a numeric value to % Mouth Open of the original and then to stretch or shrink the mouth to the new position to fit the new dialogue. The crux of this approach is an algorithm; that for each pixel of the original source footage related to the mouth, determines the intensity of the pixel, and then based on the newly recorded video of an actor delivering the new dialogue, determines the intensity of the corresponding pixel, and replaces the original pixel with the intensity of the new pixel. They do not describe how they correlate pixels from the original actor to pixels from an entirely different actor with an entirely different anatomical head shape and head position relative to camera, and different skin and lip color.
The new actor is shot with a different camera, which causes differences in resolution, and a different lens which causes a different amount of lens warping and pin cushioning leading to pixel inaccuracies. Variation in mouth size and shape as well as different size lips between the original actor and the new voice over actor would cause an automated approach difficulty in discerning the difference between pixels that vary either due to positioning for different dialogue or simply because of the anatomical mismatch in the actors themselves. This could cause unreliability in this approach.
To better illustrate this point, two actors speaking the exact same dialogue (same language) would have differences in inflection of voice thus a different amount of mouth opening and pursing as well as differences due to their anatomical differences. Now compound this issue with two actors speaking different languages on top of everything just mentioned and the results would be unpredictable in trying to determine pixel correlations between original actor and voice over actor.
Another important factor is that the voice over actor would be photographed in an entirely different environment with entirely different lighting conditions and shadows affecting the voice over actors face when compared to the original actor, thus differences in pixel intensities could result from this alone. The new actor is said to be videotaped, so there is no resolution match, and no aspect ratio match. All of the above notations could cause pixels to be different simply because of the issues mentioned instead of difference due to mouth positions, which could lead to inaccurate results in pixel intensity choices.
Another technique uses the built in MPEG-4 facial tracking features. Lip objects of the original actor are tracked, the lip objects of the voice over actor delivering the new audio are tracked, then the voice over actor's lip objects are added to the original actor and displayed. However, it is unclear how the voice over lip objects are blended onto the geometry of the original actor to be smooth and seemless. It is also unclear how facial expressions (cheek positions, facial wrinkles, smiles etc.) that occur in the original dialogue is made to fit the new dialogue. It is unclear how this technique manages skin texture generation, lighting and shadow methodologies for the new skin that needs to be rendered due to the new lip object. If the new lip objects from the new video are simply composited onto the original actor, it is not possible that they will replace the original lip objects perfectly since the new actor was not photographed under identical lighting conditions or relationship to the camera or with the same camera and lens package.
As noted in “MPEG-4 compliant tracking of facial features in video sequences” (http://www-artemis.it-sudparis.eu/Publications/library/ICAV3D01-malciu.pdf) Marius Malciu and Françoise Prêteux explain the difficulties related to “face and facial expression recognition and model-based facial image coding. Though intuitive for biological vision systems, locating faces and facial components in video sequences remains today a challenging and widely open issue in computer vision. The main difficulties encountered refer to the complexity and high variability of face morphology and head motion, and the lack of universal assumptions on the scene structure, which often involves arbitrary and complex background together with unknown and variable lighting conditions”.
Another technique includes compression of talking head video and animation of synthetic faces. Synthetic faces to replace original actors face raises a whole host of problems including modeling to fit the original actor, 3D matching of movements to the original actors movements, texturing, lighting, and shadowing in a 3D rendering package, and compositing and blending onto the original actor. These problems are not addressed by the prior art.
In another technique shape vectors of each frame are warped to a common standard frame thereby generating an aligned shape vector and a transformed image for each frame. The problem with this technique is that it can lead to pixels that have been warped a large distance (when the standard frame is different enough from the actual frame) and will look unnatural and non-realistic.
In another technique vicinity-searching, three-dimensional head modeling of the original speaker, and texture mapping are used to produce new images which correspond to the dubbed sound track. A generic three-dimensional head model is fitted to the actor's two-dimensional head picture. The fitted three-dimensional head model parameters may be tracked throughout the movie, from one frame to its successor, iteratively in an automated computerized way and creating a library of reference similarity frames.
However, such approaches require, among other things, a three-dimensional head model for every speaking actor in a movie, match-moving of camera movement and every actor's head motion throughout the movie, texture mapping and lighting every actor to fit every varied scene of the movie requiring thousands of lighting setups. The amount of time, effort and expense to achieve this approach is mind-boggling.
In another technique a database is used to obtain images for phonemes, and morphing techniques are used to create transitions between the images. Different parts of the images can be processed in different ways to make a more realistic speech pattern.
Considering all the techniques that address this subject matter, to date, not a single live action movie has been released by any of these other sources that have actually modified the mouth position of the actors in a live action movie to fit the new dialogue. Therefore, there is still a need for dubbing technology that can cost-effectively and efficiently present a foreign film in a native language by giving the appearance that a foreign actor is speaking in the native language.