The invention relates to the creation, manipulation, transmission, storage, etc. and in particular synchronization of multi-media entertainment, educational, surveillance and other programming having at least video and associated information. Such associated information includes audio, data and/or any other information which has a temporal relationship with the video. Generally the invention will be useful in any system or device in which it is desired that a timing relationship of two or more signals be maintained, measured, obtained and/or corrected and is particular useful with respect to image, audible and data signals with temporal timing relationships.
The creation, manipulation, transmission, storage, display etc. of multi-media entertainment, educational, surveillance and other programming having at least video and associated information is often preferred to have a degree of synchronization between the video portion and such associated information. Typical examples of such programming are television and movie programs. Often these programs include a visual or video portion, an audible or audio portion, and may also include one or more various data type portions, for one example data pertaining to a sales (e.g. cash register) or credit card terminal and the video from a surveillance camera viewing the device(s) generating such data. Typical data type portions include closed captioning, narrative descriptions for the blind, additional program origination, storage, transmission and information data for example such as web sites, financial and transactional data and further information directives, and various metadata included in compressed (e.g., MPEG and JPEG) systems.
Television programs having data, audio and video portions having temporal timing relationships will be used by way of example in respect to the description of the preferred embodiment of the invention. Often the video and associated signal programs are produced, operated on, stored or conveyed in a manner such that the synchronization of various ones of the aforementioned audio, video and/or data is affected. For example the synchronization of audio and video, commonly known as lip sync, may be askew when the program is produced. If the program is produced with correct lip sync, that timing may be upset by subsequent operations, for example such as processing, storing or transmission of the program. It is important to recognize that a television program which is produced with lip sync intact may have the lip sync subsequently upset. That upset may be corrected by analyzing the audio and video signal processing delay differential which causes such subsequent upset. If the television program is initially produced with lip sync in error the subsequent correction of that error is much more difficult but can be corrected with the invention. Both these problems and their solutions via the invention will be appreciated from the teachings herein.
One aspect of multi-media programming is maintaining audio and video synchronization in audio-visual presentations, such as television programs, for example to prevent annoyances to the viewers, to facilitate further operations with the program or to facilitate analysis of the program. Various approaches to this challenge are described in issued patents. U.S. Pat. No. 4,313,135, U.S. Pat. No. 4,665,431; U.S. Pat. No. 4,703,355; U.S. Pat. No. Re. 33,535; U.S. Pat. No. 5,202,761; U.S. Pat. No. 5,530,483; U.S. Pat. No. 5,550,594; U.S. Pat. No. 5,572,261; U.S. Pat. No. 5,675,388; U.S. Pat. No. 5,751,368; U.S. Pat. No. 5,920,842; U.S. Pat. No. 5,946,049; U.S. Pat. No. 6,098,046; U.S. Pat. No. 6,141,057; U.S. Pat. No. 6,330,033; U.S. Pat. No. 6,351,281; U.S. Pat. No. 6,392,707; U.S. Pat. No. 6,421,636, U.S. Pat. No. 6,469,741 and U.S. Pat. No. 6,989,869. Generally these patents deal with detecting, maintaining and correcting lip sync and other types of video and related signal synchronization.
U.S. Pat. No. 5,572,261 describes the use of actual mouth images in the video signal to predict what syllables are being spoken and compare that information to sounds in the associated audio signal to measure the relative synchronization. Unfortunately when there are no images of the mouth, there may be no ability to determine which syllables are being spoken.
As another example, in systems where there exists the ability to measure the relation between audio and video portions of programs, an audio signal may correspond to one or more of a plurality of video signals, and it is desired to determine which. For example in a television studio where each of three speakers wears a microphone and each actor has a corresponding camera which takes images of the speaker, it is desirable to correlate the audio programming to the video signals from the cameras. One use of such correlation is to automatically select (e.g. for transmission or recording) the camera which televises the actor which is currently speaking. As another example when a particular camera is selected it is useful to select the audio corresponding to that video signal. In yet another example, it is useful to inspect an output video signal, and determine which of a group of video signals it corresponds to thereby facilitating automatic selection or timing of the corresponding audio. Commonly assigned patents describing these types of systems are described in U.S. Pat. Nos. 5,530,483 and 5,751,368.
Generally, with the exception of U.S. Pat. Nos. 5,572,261, 5,530,483 and 5,751,368, the above patents describe operations without inspection or response to the video signal images. Consequently the applicability of the descriptions of the patents is limited to particular systems where various video timing information, etc. is utilized. U.S. Pat. Nos. 5,530,483 and 5,751,368 deal with measuring video delays and identifying video signal by inspection of the images carried in the video signal, but do not make any comparison or other inspection of video and audio signals. U.S. Pat. No. 5,572,261 teaches the use of actual mouth images in the video signal and sounds in the associated audio signal to measure the relative synchronization. U.S. Pat. No. 5,572,261 describes a mode of operation of detecting the occurrence of mouth sounds in both the lips and audio. For example, when the lips move to make a sound like an E and an E is present in the audio, the time relation between the occurrences of these two events is used as a measure of the relative delay there between. The description in U.S. Pat. No. 5,572,261 describes the use of a common attribute for example such as particular sounds made by the lips, which can be detected in both audio and video signals. The detection and correlation of visual moving of the lips corresponding to certain sounds and the audible presence of the corresponding sound is computationally intensive leading to high cost and complexity.
In a paper, J. Hershey, and J. R. Movellan (“Audio-Vision: Locating sounds via audio-visual synchrony” Advances in Neural Information Processing Systems 12, edited by S. A. Solla, T. K. Leen, K-R Muller. MIT Press, Cambridge, Mass. (MIT Press, Cambridge, Mass., (c) 2000)) it was recognized that sounds could be used to identify corresponding individual pixels in the video image. The correlation between the audio signal and individual ones of the pixels in the image were used to create movies that show the regions of the video that have high correlation with the audio. From the correlation data they estimate the centroid of image activity and use this to find the talking face. Hershey et al. described the ability to identify which of two speakers in a television image was speaking by correlating the sound and different parts of the face to detect synchronization. Hershey et al. noted, in particular, that “[i]t is interesting that the synchrony is shared by some parts, such as the eyes, that do not directly contribute to the sound, but contribute to the communication nonetheless.” More particularly, Hershey et al. noted that these parts of the face, including the lips, contribute to the communication as well. There was no suggestion by Hershey and Movellan that their algorithms could measure synchronization or perform any of the other features of the invention. Again they specifically said that they do not directly contribute to the sound. In this reference, the algorithms merely identified who was speaking based on the movement or non movement of features.
In another paper, M. Slaney and M. Covell (“FaceSync: A linear operator for measuring synchronization of video facial images and audio tracks” available at www.slaney.org) described that Eigen Points could be used to identify lips of a speaker, whereas an algorithm by Yehia, Ruben, Batikiotis-Bateson could be used to operate on a corresponding audio signal to provide positions of the fiduciary points on the face. The similar lip fiduciary points from the image and fiduciary points from the Yehia algorithm were then used for a comparison to determine lip sync. Slaney and Covell went on to describe optimizing this comparison in “an optimal linear detector, equivalent to a Wiener filter, which combines the information from all the pixels to measure audio-video synchronization.” Of particular note, “information from all of the pixels was used” in the FaceSync algorithm, thus decreasing the efficiency by taking information from clearly unrelated pixels. Further, the algorithm required the use of training to specific known face images, and was further described as “dependent on both training and testing data sizes.” Additionally, while Slaney and Covell provided mathematical explanation of their algorithm, they did not reveal any practical manner to implement or operate the algorithm to accomplish the lip sync measurement. Importantly the Slaney and Covell approach relied on fiduciary points on the face, such as corners of the mouth and points on the lips.
Also, U.S. Pat. No. 5,387,943 of Silver, a method is described the requirements that the mouth be identified by an operator. And, like U.S. Pat. No. 5,572,261 discussed above, utilizes video lip movements. In both of these references, only the mere lip movement is focused on. No other characteristic of the lips or other facial features, such as the shape of the lips, is considered in either of these disclosed methods. In particular, the spatial lip shape is not detected or considered in either of these references. Rather, only the movement and whether the lips are opened or closed are discussed.
In the U.S. application Ser. No. 11/598,870, filed on Nov. 13, 2006 by the inventor, a method is described for directly comparing images conveyed in the video portion of a signal to characteristics in an associated signal, such as an audio signal. The invention considers the shape and movement of the lips, providing substantially improved accuracy of audio and video synchronization of spoken words by video characters. Furthermore, the invention provides a method for determining different spoken sounds by determining whether teeth are present between the open lips, such as when the letters “v” or “s”, for example, are pronounced. A system configured according to the invention can thus reduce or remove one or more of the effects of different speaker related voice characteristics
The term Audio and Video MuEv (ref U.S. Pat. No. 7,499,104, Publication 20040227856) is introduced by the inventor. MuEv is the contraction of Mutual Event, to mean an event occurring in an image, signal or data which is unique enough that it may be accompanied by another MuEv in an associated signal. Such two MuEvs are, for example, Audio and Video MuEv-s, where certain video quality (or sequence) corresponds to a unique and matching audio event. One simple example of audio and video MuEvs are the crack of a baseball bat hitting a ball (the audio MuEv) and the instant change of direction of the ball (the video MuEv). Because both happen at the same instant, they can be utilized to determine any subsequent mistiming of audio and video.
This may be done for faces and speech by first acquiring Audio and Video MuEvs from input audio-video signals, and using them to calibrate an audio video synchronization system. The MuEv acquisition and calibration phase is followed by analyzing the audio information, and analyzing the video information. From this Audio MuEvs and Video MuEvs are calculated from the audio and video information, and the audio and video information is classified into vowel sounds including, but not limited to, AA, EE, OO (capital double letters signifying the sounds of vowels a, e and o respectively), letters “s”, “v”, “z” and “f” i.e. closed mouth shapes when teeth are present, letters “p”, “b”, “m”, i.e. closed mouth shapes where teeth are not present, silence, and other unclassified phonemes. This information is used to determine and associate a dominant audio class with one or more corresponding video frames. Matching locations are determined, and the offset of video and audio is determined. A simply explained example is that the sound EE (an audio MuEv) may be identified as occurring in the audio information and matched to a corresponding image characteristic like lips forming a shape associated with speaking the vowel EE (a video MuEv) with the relative timing thereof being measured or otherwise utilized to determine or correct a lip sync error.