Portions of this application are contained on compact disc(s) the contents of which are entirely incorporated herein by reference. The compact discs are labeled as Copy 1, and Copy 2, respectively. The compact discs are identical and each includes the following ASCII files:
The present invention relates to a method for automatic audio visual dubbing. More specifically the said invention relates to an efficient computerized automatic method for audio visual dubbing of movies by computerized image copying of the characteristic features of the lips movements of the dubber onto the mouth area of the original speaker. The present invention uses a method of vicinity-searching, three-dimensional head modeling of the original speaker, and texture mapping techniques in order to produce the new images which correspond to the dubbed sound track.
The invention overcomes the well known disadvantage of the correlation problems between lips movement in the original movie and the sound track of the dubbed movie.
First are provided some definitions of important key words employed in this specification.
Actor (the original actor)xe2x80x94an actor, speaker, singer, animated character, animal, an object in a movie, or a subject in a still photograph.
Audio visual Dubbingxe2x80x94Manipulating, in one or more frames, the mouth area of the actor so that its status will be similar as much as possible to that of the dubber in the reference frame.
Correlation Functionxe2x80x94A function describing the similarity of two image regions. The higher the correlation, the better is the match.
Dubberxe2x80x94The person or persons, who speak/narrate/sing/interpret the target text. The dubber can be the same as the actor.
Dubbingxe2x80x94Replacing part or all of one or more of the original sound tracks of a movie, with its original text or sounds (including the case of the silent track of a still photograph), by another sound track containing the target text and/or sound.
Edge Detectorxe2x80x94A known image processing technique used to extract boundaries between image regions which differ in intensity and/or color.
Face Parametrizationxe2x80x94A method that numerically describes the structure, location, and expression of the face.
Head Modelxe2x80x94A three-dimensional wire frame model of the face that is controlled by numerous parameters that describe the exact expression produced by the model (i.e. smile, mouth width, jaw opening, etc.).
Movie (the original movie)xe2x80x94Any motion picture (e.g. cinematic feature film, advertisement, video, animated cartoon, still video picture, etc.). A sequence of consecutive pictures (also called frames) photographed in succession by a camera or created by an animator. In the case of the movie being a still photograph, all of the consecutive pictures are identical to each other. When shown in rapid succession an illusion of natural motion is obtained, except for the case of still pictures. A sound-track is associated with most movies, which contains speech, music, and/or sounds, and which is synchronized with the pictures, and in particular where the speech is synchronized with the lip movements of the actors in the pictures. Movies are realized in several techniques. Common methods are: (a) recording on film, (b) recording in analog electronic form (xe2x80x9cvideoxe2x80x9d), (c) recording in digital electronic form, (d) recording on chips, magnetic tape, magnetic disks, or optical disks, and (e) read/write by magnetic and/or optical laser devices. Finally, in our context, an xe2x80x9coriginal moviexe2x80x9d is also an audio-visual movie altered by the present invention, which serves as a base for further alterations.
Original Textxe2x80x94A text spoken or sung by the actor when the movie is being made, and which is recorded on its sound track. The text may be narrated in the background without showing the speaker, or by showing a still photograph of the speaker.
Pixelxe2x80x94Picture element. A digital picture is composed of an array of points, called pixels. Each pixel encodes the numerical values of the intensity and of the color at the corresponding picture point.
Reference Similarity Framexe2x80x94A picture (being a frame in the original movie, a frame in any other movie, or a still photograph) in which the original actor has the desired features of the mouth-shape and head posture suitable for the audio visually dubbed movie.
Target Textxe2x80x94A new vocal text, to replace the original vocal text of the actor. The target text may also be that which is to be assigned to an actor who was silent in the original movie. The new text can be in another language, to which one refers as DUBBING. However, this invention relates also to replacement of text without changing the language, with the original actor or with a dubber in that same language. The target text may have the same meaning as the original text, but may have also a modified, opposite, or completely different meaning. According to one of many applications of the present invention, the latter is employed for creation of new movies with the same actor, without his/her/its active participation. Also included is new vocal text used to replace the null vocal text attached to one or more still photographs.
Texture Mappingxe2x80x94A well known technique in computer graphics which maps texture onto a three-dimensional wire frame model.
Two-Dimensional Projectionxe2x80x94The result of the rendering of the three-dimensional face model onto a two-dimensional device like a monitor, a screen, or photographic film.
Movies are often played to an audience that is not familiar with the original language, and thus cannot understand the sound track of such movies. Two well known common approaches exist to solve this problem. In one approach sub-titles in typed text of the desired language are added to the pictures, and the viewers are expected to hear the text in a foreign language and simultaneously to read its translation on the picture itself. Such reading distracts the viewers from the pictures and from the movie in general. Another approach is dubbing, where the original sound-track with the original text is being replaced by another sound-track with the desired language. In this case there is a disturbing mis-match between the sound-track and the movements of the mouth.
There have been some earlier attempts to overcome these disadvantages, none of which have been commercialized because of inherent principal difficulties which made the practical execution unrealistic. Thus, in U.S. Pat. No. 4,600,281 a method is described which performs the measurements of the shape of the mouth manually by a ruler or with a cursor, and corrects the mouth shape by moving pixels within each frame. As will be seen in the description of the invention, the method according to the present invention is inherently different and much superior in the following points: In the present invention the tracking of the shape of the mouth is done automatically and not manually. In the present invention changing the shape of the mouth is done by using a three-dimensional head model, for example like those described by P. Ekman and W. V. Friesen, (Manual for the Facial Action Unit system, consulting Psychologist Press, Palo Alto 1977). In the present invention the mouth area of the actor is replaced using the mouth area of a reference similarity frame. In the present invention mouth status parameters of the dubber are substituted for mouth status parameters of the actor.
The U.S. Pat. No. 4,260,229 relates to a method of graphically creating lip images. This U.S. patent is totally different from the present invention: In the U.S. patent, speech sounds are analyzed and digitally encoded. In the present invention no sound analysis is done; nor is any required at all.
To make for better viewing of the audio visually dubbed movie, the present invention provides a computerized method wherein, in addition to replacing the sound track to the target text, the mouth movements of the actor are being automatically changed to match the target text. The new mouth movements are linguistly accurate and visually natural looking according to all of the observable parameters of the actor""s face.
The present invention provides a method for automated computerized audio visual dubbing of movies, comprising of the following steps (see FIG. 1):
(a) selecting from the movie a frame having a picture, preferably frontal, of the actor""s head and, if available, a frame with its side profile;
(b) marking on the face several significant feature points and measuring their locations in the frame;
(c) fitting a generic three-dimensional head model to the actor""s two-dimensional head picture by adapting the data of the significant feature points, as measured in stage (b), to their location in the model;
(d) tracking of the said fitted three-dimensional head model parameters throughout the movie, from one frame to its successor, iteratively in an automated computerized way and creating a library of reference similarity frames.
(e) taking a movie of a dubber wherein the dubber speaks the target text;
(f) repeating stages (a), (b), (c), and (d) with the dubber;
(g) normalizing the dubber""s minimum and maximum values of each parameter to the actor""s minimum and maximum values of the same parameters;
(h) mapping, on a frame to frame basis, the two-dimensional actors face onto its three-dimensional head model by using a texture mapping technique, making use of reference similarity frames;
(i) changing the texture mapped three-dimensional model obtained in stage (h) by replacing, on a frame to frame basis, the original mouth parameters with the mouth parameters as computed in stage (d) for the dubber and obtaining the parametric description for the new picture, with identical values to the original, except that the actor""s mouth status resembles the mouth status of the dubber;
(j) texture mapping the lips area of the same actor from a frame in the movie, with identical or very similar mouth status to the desired new mouth status, onto the lips area of the actor""s head model for the current frame and then projecting the lips area from the actor""s head model onto the current new frame. (This stage is optional, depending on the application.)
By using the three dimensional head model one can control the audio visual dubbing process even if the actor is moving his head. In most applications about 15 significant feature points on the face are used in the tracking stage, such as eye corners, mouth corners, and the nostrills. Only those feature points which are visible to the viewer (using the information available to the model) are tracked.
In the present invention, audio visual dubbing is normally used in conjunction with the use of audio dubbing; but one may also use it in conjunction with an audio track where no equivalent track exists in the original movie.
The method according to the present invention is useful for the audio visual dubbing of motion pictures such as cinematic feature films, advertisements, video, and animated cartoon. Also the audio visual dubbing of still photographs, wherein all of the frames of the movie are the same, is made possible by the present invention. For instance, still photographs are used for this type of movie in T.V. news programs where the reporter""s voice is heard while a still photograph of him/her is shown.
Thus, according to the present invention even speachless actors, infants, animals, and inanimate objects can be audio visually dubbed to speak in any language.
According to our invention, the animation process saves much of the labor associated with the animation of the mouth area.
The present invention further provides a computer program (see Appendix 1) for operating the computerized audio visual dubbing.
The present invention further relates to the library of reference similarity frames as created in Step d (above).