There are various situations in which it is desirable to have a video recording of a speaking person accurately track words which are different from those which were uttered during the original recording of the video image. One such application is the field of audio dubbing, in which the originally recorded soundtrack is replaced with a different soundtrack. In a simple case, after recording an image of an actor speaking a statement, it may be desirable to re-record the statement, for example to change emphasis or provide a different accent. Rather than recording the entire video sequence again, the redubbing process permits the actor to repeat the statement, with the desired modifications, and substitute the repeated statement for the originally recorded one.
In a more sophisticated video production, it may be desirable to utilize stock footage of an actor and replace the actor's spoken words with an entirely different speech soundtrack, perhaps in a different voice. For example, the original statement might be presented in a different language, or various special effects can be created, such as a child giving a speech in the original voice of a famous statesman.
In these types of applications, the original recorded image must be modified so that the speaker's lip movements are synchronized to the new soundtrack. In the past, the methods for achieving such synchronization have required extensive manual input and/or specialized processing that limited their applicability. One example of a prior art approach, which is based on image alignment, is described in U.S. Pat. No. 4,827,532. That patent is particularly directed to the replacement of a soundtrack in one language with a new soundtrack in a second language, which requires different lip movements for the speaker. In the technique disclosed in the '532 patent, a video recording is made of a new actor speaking the statements in the new language. Special markers are required to define the outlines of the actor's lips in the newly recorded image of the actor speaking in the new language, and the original video must be manually marked. Once the corresponding portions of the old and new video images have been identified, pixels of the original movie frame are modified to make them look like the original actor spoke the words of the new soundtrack.
The procedure disclosed in the '532 patent involves two types of video modification. First, the video sequence is temporally warped, in an effort to align the frames of the original image with the new sounds, so that the lip shapes match one another. Thereafter, visual warping, e.g., morphing, of the image is carried out to transition between non-continuous portions of the video that may result from skipping frames.
Both of these warping steps require a significant amount of manual input. As a result, lengthy video sequences with language dubbing are not easily produced. Furthermore, it is not possible in all cases to temporally warp a video sequence in such a way that the new lip shapes match the original shapes. For example, the image for a closed-lip sound cannot be warped into one for an open-lipped sound, because the teeth and/or tongue would be missing. A similar problem occurs for sounds which are produced with different lip protrusions. Thus, the types of changes which can be effected are limited. In addition, the new soundtrack requires a second video recording, so that the two recorded sequences can be visually aligned. As such, the procedure cannot be used with any arbitrary utterance as the new soundtrack. Rather, only soundtracks which have accompanying video images can be employed.
Other approaches have been used in the field of animation, so that a character's mouth accurately tracks spoken words. However, the images that are used in these approaches are synthetic, and their associated synchronization techniques are not suited for use with video images of a natural person's face.
Accordingly, it is desirable to provide a technique which permits any given sound utterance to be substituted for the soundtrack of a previously recorded video sequence, without requiring a video recording of the new sounds being uttered. It is further desirable to provide such a method which readily lends itself to automation, to thereby minimize the amount of manual input that is required.