1. Field of the Invention
The invention relates to the creation, manipulation, transmission, storage, etc. and especially synchronization of multi-media entertainment, educational and other programming having at least video and associated information.
2. Background Art
The creation, manipulation, transmission, storage, etc. of multi-media entertainment, educational and other programming having at least video and associated information requires synchronization. Typical examples of such programming are television and movie programs. Often these programs include a visual or video portion, an audible or audio portion, and may also include one or more various data type portions. Typical data type portions include closed captioning, narrative descriptions for the blind, additional program information data such as web sites and further information directives and various metadata included in compressed (such as for example MPEG and JPEG) systems.
Often the video and associated signal programs are produced, operated on, stored or conveyed in a manner such that the synchronization of various ones of the aforementioned audio, video and/or data is affected. For example the synchronization of audio and video, commonly known as lip sync, may be askew when the program is produced. If the program is produced with correct lip sync, that timing may be upset by subsequent operations, for example such as processing, storing or transmission of the program.
One aspect of multi-media programming is maintaining audio and video synchronization in audio-visual presentations, such as television programs, for example to prevent annoyances to the viewers, to facilitate further operations with the program or to facilitate analysis of the program. Various approaches to this challenge are described in commonly assigned, issued patents. U.S. Pat. No. 4,313,135, U.S. Pat. No. 4,665,431; U.S. Pat. No. 4,703,355; U.S. Pat. Re. 33,535; U.S. Pat. No. 5,202,761; U.S. Pat. No. 5,530,483; U.S. Pat. No. 5,550,594; U.S. Pat. No. 5,572,261; U.S. Pat. No. 5,675,388; U.S. Pat. No. 5,751,368; U.S. Pat. No. 5,920,842; U.S. Pat. No. 5,946,049; U.S. Pat. No. 6,098,046; U.S. Pat. No. 6,141,057; U.S. Pat. No. 6,330,033; U.S. Pat. No. 6,351,281; U.S. Pat. No. 6,392,707; U.S. Pat. No. 6,421,636 and U.S. Pat. No. 6,469,741. Generally these patents deal with detecting, maintaining and correcting lip sync and other types of video and related signal synchronization.
U.S. Pat. No. 5,572,261 describes the use of actual mouth images in the video signal to predict what syllables are being spoken and compare that information to sounds in the associated audio signal to measure the relative synchronization. Unfortunately when there are no images of the mouth, there is no ability to determine which syllables are being spoken.
As another example, in systems where the ability to measure the relation between audio and video portions of programs, an audio signal may correspond to one or more of a plurality of video signals, and it is desired to determine which. For example in a television studio where each of three speakers wears a microphone and each actor has a corresponding camera which takes images of the speaker, it is desirable to correlate the audio programming to the video signals from the cameras. One use of such correlation is to automatically select (for transmission or recording) the camera which televises the actor which is currently speaking. As another example when a particular camera is selected it is useful to select the audio corresponding to that video signal. In yet another example, it is useful to inspect an output video signal, and determine which of a group of video signals it corresponds to thereby facilitating automatic selection or timing of the corresponding audio. Commonly assigned patents describing these types of systems are described in U.S. Pat. Nos. 5,530,483 and 5,751,368.
The above patents are incorporated in their entirety herein by reference in respect to the prior art teachings they contain.
Generally, with the exception of U.S. Pat. Nos. 5,572,261, 5,530,483 and 5,751,368, the above patents describe operations without any inspection or response to the video signal images. Consequently the applicability of the descriptions of the patents is limited to particular systems where various video timing information, etc. is utilized. Patents 5,530,483 and 5,751,368 deal with measuring video delays and identifying video signal by inspection of the images carried in the video signal, but do not make any comparison or other inspection of video and audio signals. Patent 5,572,261 teaches the use of actual mouth images in the video signal and sounds in the associated audio signal to measure the relative synchronization. U.S. Pat. No. 5,572,261 describes a mode of operation of detecting the occurrence of mouth sounds in both the lips and audio. For example, when the lips take on a position used to make a sound like an E and an E is present in the audio, the time relation between the occurrence of these two events is used as a measure of the relative delay therebetween. The description in U.S. Pat. No. 5,572,261 describes the use of a common attribute for example such as particular sounds made by the lips, which can be detected in both audio and video signals. The detection and correlation of visual positioning of the lips corresponding to certain sounds and the audible presence of the corresponding sound is computationally intensive leading to high cost and complexity.
In a paper, J. Hershey, and J. R. Movellan (“Audio-Vision: Locating sounds via audio-visual synchrony” Advances in Neural Information Processing Systems 12, edited by S. A. Solla, T. K. Leen, K-R Muller. MIT Press, Cambridge, Mass. (MIT Press, Cambridge, Mass., (c) 2000)) it was recognized that sounds could be used to identify corresponding individual pixels in the video image The correlation between the audio signal and individual ones of the pixels in the image were used to create movies that show the regions of the video that have high correlation with the audio and from the correlation data they estimate the centroid of image activity and use this to find the talking face. Hershey et al. described the ability to identify which of two speakers in a television image was speaking by correlating the sound and different parts of the face to detect synchronization. Hershey et al. noted, in particular, that “[i]t is interesting that the synchrony is shared by some parts, such as the eyes, that do not directly contribute to the sound, but contribute to the communication nonetheless.” There was no suggestion by Hershey and Movellan that their algorithms could measure synchronization or perform any of the other features of the present invention.
In another paper, M. Slaney and M. Covell (“FaceSync: A linear operator for measuring synchronization of video facial images and audio tracks” available at www.slaney.org). described that Eigen Points could be used to identify lips of a speaker, whereas an algorithm by Yehia, Ruben, Batikiotis-Bateson could be used to operate on a corresponding audio signal to provide positions of the fiduciary points on the face The similar lip fiduciary points from the image and fiduciary points from the Yehia algorithm were then used for a comparison to determine lip sync. Slaney and Covell went on to describe optimizing this comparison in “an optimal linear detector, equivalent to a Wiener filter, which combines the information from all the pixels to measure audio-video synchronization.” Of particular note, “information from all of the pixels was used” in the FaceSync algorithm, thus decreasing the efficiency by taking information from clearly unrelated pixels. Further, the algorithm required the use of training to specific known face images, and was further described as “dependent on both training and testing data sizes.” Additionally, while Slaney and Covell provided mathematical explanation of their algorithm, they did not reveal any practical manner to implement or operate the algorithm to accomplish the lip sync measurement. Importantly the Slaney and Covell approach relied on fiduciary points on the face, such as corners of the mouth and points on the lips.