1. Field of the Invention
The invention relates to the creation, manipulation, transmission, storage, etc. and especially synchronization of multi-media entertainment, educational and other programming having at least video and associated information.
2. Background Art
The creation, manipulation, transmission, storage, etc. of multi-media content, be it entertainment, educational, scientific, business, and other programming having at least video and associated information requires synchronization. Typical examples of such programming are television and movie programs, motion medical images, and various engineering and scientific content. These are collectively referred to as “programs.”
Often these programs include a visual or video portion, an audible or audio portion, and may also include one or more various data type portions. Typical data type portions include closed captioning, narrative descriptions for the blind, additional program information data such as web sites and further information directives and various metadata included in compressed (such as for example MPEG and JPEG) systems.
Often the video and associated signal programs are produced, operated on, stored or conveyed in a manner such that the synchronization of various ones of the aforementioned audio, video and/or data is affected. For example the synchronization of audio and video, commonly known as lip sync, may be askew when the program is produced. If the program is produced with correct lip sync, that timing may be upset by subsequent operations, for example such as processing, storing or transmission of the program.
One aspect of multi-media programming is maintaining audio and video synchronization in audio-visual presentations, such as television programs, for example to prevent annoyances to the viewers, to facilitate further operations with the program or to facilitate analysis of the program. Various approaches to this challenge are described in commonly assigned, issued patents. U.S. Pat. Nos. 4,313,135, 4,665,431; 4,703,355; U.S. Pat. Re. 33,535; U.S. Pat. Nos. 5,202,761; 5,530,483; 5,550,594; 5,572,261; 5,675,388; 5,751,368; 5,920,842; 5,946,049; 6,098,046; 6,141,057 ; 6,330,033; 6,351,281; 6,392,707; 6,421,636 and 6,469,741. Generally these patents deal with detecting, maintaining and correcting lip sync and other types of video and related signal synchronization.
U.S. Pat. No. 5,572,261 describes the use of actual mouth images in the video signal to predict what syllables are being spoken and compare that information to sounds in the associated audio signal to measure the relative synchronization. Unfortunately when there are no images of the mouth, there is no ability to determine which syllables are being spoken.
As another example, in systems where the ability to measure the relation between audio and video portions of programs, an audio signal may correspond to one or more of a plurality of video signals, and it is desired to determine which. For example in a television studio where each of three speakers wears a microphone and each actor has a corresponding camera which takes images of the speaker, it is desirable to correlate the audio programming to the video signals from the cameras. One use of such correlation is to automatically select (for transmission or recording) the camera which televises the actor which is currently speaking. As another example when a particular camera is selected it is useful to select the audio corresponding to that video signal. In yet another example, it is useful to inspect an output video signal, and determine which of a group of video signals it corresponds to thereby facilitating automatic selection or timing of the corresponding audio. Commonly assigned patents describing these types of systems are described in U.S. Pat. Nos. 5,530,483 and 5,751,368.
The above patents are incorporated in their entirety herein by reference in respect to the prior art teachings they contain.
Generally, with the exception of U.S. Pat. Nos. 5,572,261, 5,530,483 and 5,751,368, the above patents describe operations without any inspection or response to the video signal images. Consequently the applicability of the descriptions of the patents is limited to particular systems where various video timing information, etc. is utilized. U.S. Pat. Nos. 5,530,483 and 5,751,368 deal with measuring video delays and identifying video signal by inspection of the images carried in the video signal, but do not make any comparison or other inspection of video and audio signals. U.S. Pat. No. 5,572,261 teaches the use of actual mouth images in the video signal and sounds in the associated audio signal to measure the relative synchronization. U.S. Pat. No. 5,572,261 describes a mode of operation of detecting the occurrence of mouth sounds in both the lips and audio. For example, when the lips take on a position used to make a sound like an E and an E is present in the audio, the time relation between the occurrence of these two events is used as a measure of the relative delay therebetween. The description in U.S. Pat. No. 5,572,261 describes the use of a common attribute for example such as particular sounds made by the lips, which can be detected in both audio and video signals. The detection and correlation of visual positioning of the lips corresponding to certain sounds and the audible presence of the corresponding sound is computationally intensive leading to high cost and complexity.
In a paper, J. Hershey, and J. R. Movellan (“Audio-Vision: Locating sounds via audio-visual synchrony” Advances in Neural Information Processing Systems 12, edited by S. A. Solla, T. K. Leen, K-R Muller. MIT Press, Cambridge, Mass. (MIT Press, Cambridge, Mass., ©2000)) it was recognized that sounds could be used to identify corresponding individual pixels in the video image The correlation between the audio signal and individual ones of the pixels in the image were used to create movies that show the regions of the video that have high correlation with the audio and from the correlation data they estimate the centroid of image activity and use this to find the talking face. Hershey et al. described the ability to identify which of two speakers in a television image was speaking by correlating the sound and different parts of the face to detect synchronization. Hershey et al. noted, in particular, that “[i]t is interesting that the synchrony is shared by some parts, such as the eyes, that do not directly contribute to the sound, but contribute to the communication nonetheless.” There was no suggestion by Hershey and Movellan that their algorithms could measure synchronization or perform any of the other features of the present invention.
In another paper, M. Slaney and M. Covell (“FaceSync: A linear operator for measuring synchronization of video facial images and audio tracks” available at www.slaney.org). described that Eigen Points could be used to identify lips of a speaker, whereas an algorithm by Yehia, Ruben, Batikiotis-Bateson could be used to operate on a corresponding audio signal to provide positions of the fiduciary points on the face The similar lip fiduciary points from the image and fiduciary points from the Yehia algorithm were then used for a comparison to determine lip sync. Slaney and Covell went on to describe optimizing this comparison in “an optimal linear detector, equivalent to a Wiener filter, which combines the information from all the pixels to measure audio-video synchronization.” Of particular note, “information from all of the pixels was used” in the FaceSync algorithm, thus decreasing the efficiency by taking information from clearly unrelated pixels. Further, the algorithm required the use of training to specific known face images, and was further described as “dependent on both training and testing data sizes.” Additionally, while Slaney and Covell provided mathematical explanation of their algorithm, they did not reveal any practical manner to implement or operate the algorithm to accomplish the lip sync measurement. Importantly the Slaney and Covell approach relied on fiduciary points on the face, such as corners of the mouth and points on the lips.
The video and audio signals in a television system are increasingly being subjected to more and more steps of digital processing. Each step has the potential to add a different amount of delay to the video and audio, thereby introducing a lip sync error. Incorrect lip sync is a major concern to newscasters, advertisers, politicians and others who are trying to convey a sense of trust, accuracy and sincerity to their audience. Studies have demonstrated that when lip sync errors are present, viewers perceive a message as less interesting, more unpleasant, less influential and less successful than the same message with proper lip sync.
Because light travels faster than sound, we are used to seeing events before we hear them—lightning before thunder, a puff of smoke before a cannon shot and so on. Therefore, to some extent, we can tolerate “late” audio. Unfortunately, as shown in FIG. 1, even in a simple television system, the video is almost always delayed more than the audio, creating the unnatural situation of “early” audio. Any one contributor to the lip sync error may or may not be noticeable. But the cumulative error from the original acquisition point to the viewer can easily become both noticeable and objectionable. The potential for lip sync errors increases even further when MPEG compressed links are added to one or more stages of the overall system—however, that's a topic for another day.
From CCD cameras, to frame synchronizers, production switchers, digital video effects, noise reducers, MPEG encoders and decoders, TVs with digital processing and the like, the video is delayed more than the audio. Worse yet, the amount of video delay frequently jumps by a frame or more as the operating mode changes, or as frames of video are dropped or repeated. So, using a fixed audio delay to “mop up” the errors is rarely a satisfactory solution.
Standards committees in various countries have studied the lip sync problem and have set guidelines for the maximum allowable errors. For the most part, these studies have determined that lip sync errors become noticeable if the audio is early by more than 25-35 milliseconds (about 1 NTSC frame) or late by more than 80-90 milliseconds (2.5-3.0 NTSC frames). In June of 2003, the Advanced Television Systems Committee (ATSC) issued a finding that stated “ . . . at the inputs to the DTV encoding device . . . the sound program should never lead the video program by more than 15 milliseconds, and should never lag the video program by more than 45 milliseconds.” The finding continued “Pending [a finding on tolerances for system design], designers should strive for zero differential offset throughout the system.” In other words, it is important to eliminate or minimize the errors at each stage where they occur, instead of allowing them to accumulate.
Fortunately, the “worst case” condition in FIG. 3 is now less likely to present itself than was the case a few years ago. Firstly, it is now quite common to install audio tracking delays, exemplified by the Pixel Instruments AD-3000, alongside each video frame synchronizer, thereby eliminating at least one common source of variable lip sync errors.
Secondly, newer master control switchers have an internal DVE for squeezeback operation rather than an external DVE. This allows the use of a constant insertion delay of 1 frame for both the video and the audio paths in all modes of operation.
Since the 1970s, digital video effects processors (DVEs or transform engines) have been used to produce “over the shoulder”, “double box” and other multiple source composited effects. The video being transformed is delayed (usually by one or more frames) relative to the background video in the switcher. So, any time one or more DVE processors are on-air, the associated video sources will be delayed, resulting in a lip sync error. In the past, when the DVE processor was external to the switcher, a tally signal from the switcher could be used to trigger the insertion of a compensating audio delay when the DVE in on-air. However, today's production switchers are usually equipped with internal DVEs and a tally output is no longer available.
Thus, a need exists for a lip synchronization method providing direct comparison of the video images conveyed in the video portion of a signal to one or more characteristics in an associated signal, such as an audio signal.