The invention relates to the creation, manipulation, transmission, storage, etc. and especially synchronization of multi-media entertainment, educational and other programming having at least video and associated information.
The creation, manipulation, transmission, storage, etc. of multi-media entertainment, educational and other programming having at least video and associated information requires synchronization. Typical examples of such programming are television and movie programs. Often these programs include a visual or video portion, an audible or audio portion, and may also include one or more various data type portions. Typical data type portions include closed captioning, narrative descriptions for the blind, additional program information data such as web sites and further information directives and various metadata included in compressed (such as for example MPEG and JPEG) systems.
Often the video and associated signal programs are produced, operated on, stored or conveyed in a manner such that the synchronization of various ones of the aforementioned audio, video and/or data is affected. For example the synchronization of audio and video, commonly known as lip sync, may be askew when the program is produced. If the program is produced with correct lip sync, that timing may be upset by subsequent operations, for example such as processing, storing or transmission of the program. It is important to recognize that a television program which is produced with lip sync intact may have the lip sync subsequently upset. That upset may be corrected by analyzing the audio and video signal processing delay differential which causes such subsequent upset. If the television program is initially produced with lip sync in error the subsequent correction of that error is much more difficult but can be corrected with the invention. Both these problems and their solutions via the invention will be appreciated from the teachings herein.
One aspect of multi-media programming is maintaining audio and video synchronization in audio-visual presentations, such as television programs, for example to prevent annoyances to the viewers, to facilitate further operations with the program or to facilitate analysis of the program. Various approaches to this challenge are described in issued patents. U.S. Pat. Nos. 4,313,135, 4,665,431; 4,703,355; U.S. Pat. No. Re. 33,535; U.S. Pat. Nos. 5,202,761; 5,530,483; 5,550,594; 5,572,261; 5,675,388; 5,751,368; 5,920,842; 5,946,049; 6,098,046; 6,141,057; 6,330,033; 6,351,281; 6,392,707; 6,421,636, 6,469,741 and 6,989,869. Generally these patents deal with detecting, maintaining and correcting lip sync and other types of video and related signal synchronization.
U.S. Pat. No. 5,572,261 describes the use of actual mouth images in the video signal to predict what syllables are being spoken and compare that information to sounds in the associated audio signal to measure the relative synchronization. Unfortunately when there are no images of the mouth, there is no ability to determine which syllables are being spoken.
As another example, in systems where the ability to measure the relation between audio and video portions of programs, an audio signal may correspond to one or more of a plurality of video signals, and it is desired to determine which. For example in a television studio where each of three speakers wears a microphone and each actor has a corresponding camera which takes images of the speaker, it is desirable to correlate the audio programming to the video signals from the cameras. One use of such correlation is to automatically select (for transmission or recording) the camera which televises the actor which is currently speaking. As another example when a particular camera is selected it is useful to select the audio corresponding to that video signal. In yet another example, it is useful to inspect an output video signal, and determine which of a group of video signals it corresponds to thereby facilitating automatic selection or timing of the corresponding audio. Commonly assigned patents describing these types of systems are described in U.S. Pat. Nos. 5,530,483 and 5,751,368.
The above patents are incorporated in their entirety herein by reference in respect to the prior art teachings they contain.
Generally, with the exception of U.S. Pat. Nos. 5,572,261, 5,530,483 and 5,751,368, the above patents describe operations without any inspection or response to the video signal images. Consequently the applicability of the descriptions of the patents is limited to particular systems where various video timing information, etc. is utilized. U.S. Pat. Nos. 5,530,483 and 5,751,368 deal with measuring video delays and identifying video signal by inspection of the images carried in the video signal, but do not make any comparison or other inspection of video and audio signals. U.S. Pat. No. 5,572,261 teaches the use of actual mouth images in the video signal and sounds in the associated audio signal to measure the relative synchronization. U.S. Pat. No. 5,572.261 describes a mode of operation of detecting the occurrence of mouth sounds in both the lips and audio. For example, when the lips take on a position used to make a sound like an E and an E is present in the audio, the time relation between the occurrences of these two events is used as a measure of the relative delay there between. The description in U.S. Pat. No. 5,572,261 describes the use of a common attribute for example such as particular sounds made by the lips, which can be detected in both audio and video signals. The detection and correlation of visual positioning of the lips corresponding to certain sounds and the audible presence of the corresponding sound is computationally intensive leading to high cost and complexity.
In a paper, J. Hershey, and J. R. Movellan (“Audio-Vision: Locating sounds via audio-visual synchrony” Advances in Neural Information Processing Systems 12, edited by S. A. Solla, T. K. Leen, K-R Muller. MIT Press, Cambridge, Mass. (MIT Press, Cambridge, Mass., (c) 2000)) it was recognized that sounds could be used to identify corresponding individual pixels in the video image. The correlation between the audio signal and individual ones of the pixels in the image were used to create movies that show the regions of the video that have high correlation with the audio and from the correlation data they estimate the centroid of image activity and use this to find the talking face. Hershey et al. described the ability to identify which of two speakers in a television image was speaking by correlating the sound and different parts of the face to detect synchronization. Hershey et al. noted, in particular, that “[i]t is interesting that the synchrony is shared by some parts, such as the eyes, that do not directly contribute to the sound, but contribute to the communication nonetheless.” More particularly, Hershey et al. noted that these parts of the face, including the lips, contribute to the communication as well. There was no suggestion by Hershey and Movellan that their algorithms could measure synchronization or perform any of the other features of the invention. Again they specifically said that they do not directly contribute to the sound. In this reference, the algorithms merely identified who was speaking based on the movement or non movement of features.
In another paper, M. Slaney and M. Covell (“FaceSync: A linear operator for measuring synchronization of video facial images and audio tracks” available at www.slaney.org) described that Eigen Points could be used to identify lips of a speaker, whereas an algorithm by Yehia, Ruben, Batikiotis-Bateson could be used to operate on a corresponding audio signal to provide positions of the fiduciary points on the face. The similar lip fiduciary points from the image and fiduciary points from the Yehia algorithm were then used for a comparison to determine lip sync. Slaney and Covell went on to describe optimizing this comparison in “an optimal linear detector, equivalent to a Wiener filter, which combines the information from all the pixels to measure audio-video synchronization.” Of particular note, “information from all of the pixels was used” in the FaceSync algorithm, thus decreasing the efficiency by taking information from clearly unrelated pixels. Further, the algorithm required the use of training to specific known face images, and was further described as “dependent on both training and testing data sizes.” Additionally, while Slaney and Covell provided mathematical explanation of their algorithm, they did not reveal any practical manner to implement or operate the algorithm to accomplish the lip sync measurement. Importantly the Slaney and Covell approach relied on fiduciary points on the face, such as corners of the mouth and points on the lips.
Also, U.S. Pat. No. 5,387,943 of Silver, a method is described the requires that the mouth be identified by an operator. And, like U.S. Pat. No. 5,572,261 discussed above, utilizes video lip movements. In either of these references, only the mere lip movement is focused on. No other characteristic of the lips or other facial features, such as the shape of the lips, is considered in either of these disclosed methods. In particular, the spatial lip shape is not detected or considered in either of these referees, just the movement, opened or closed.
Perceptual aspects of the human voice, such as pitch, loudness, timbre and timing (related to tempo and rhythm) are usually considered to be more or less independent of one another and they are considered to be related to the acoustic signal's fundamental frequency f0, amplitude, spectral envelope and time variation, respectively. Unfortunately, when conventional voice recognition techniques and synchronization techniques are attempted, they are greatly affected by individual speaker characteristics, such as low or high voice tones, accents, inflections and other voice characteristics that are difficult to recognize, quantify or otherwise identify.
It will be seen that it will be useful to recognize different movements of the lips and teeth of a speaker to better recognize different vowel sounds. Therefore, there exists a need in the art for an improved video and audio synchronization system that accounts for different mouth characteristics, such as lip, including inner area between the lips, and teeth characteristics. As will be seen, the invention accomplishes this in an elegant manner.