The term “video” is typically used to refer to a combination of video media content (e.g., a time sequence of images) and its associated audio media content. For example, such a combination of video media content and audio media content may be employed in television broadcasts and streaming video, among others. During the preparation and/or transmission of such video, the video media content and the audio media content may, at times, need to be separated to allow certain processing operations to be performed that are dependent on the nature of the respective media content. For example, in television broadcasts, such processing operations can include frame synchronization, digital video effects processing, video noise reduction, format conversion, MPEG pre-preprocessing, etc. Further, with regard to streaming video, such processing operations can include transforming the video media content and the audio media content to conform with/to one or more different protocol standards, changing the bandwidth used for the respective media content, etc. While such processing operations are being performed on the video media content and the audio media content, the video media content and the audio media content may pass through separate media channels and through different processing elements, which may subject the respective media content to different amounts of delay, resulting in a relative delay (such relative delay also referred to herein as a “temporal offset”) between the video media content and the audio media content. For example, in a television broadcast of a talking person, a viewer of the television broadcast may perceive a temporal offset between the movement of the talking person's lips in a time sequence of images, and the sound generated from the associated audio media content.
The temporal relationship between video media content and its associated audio media content is referred to herein as the “A/V sync” or “lip sync”. When not properly aligned, the video media content is said to contain A/V sync errors or lip sync errors. Although it can vary from person to person, it is generally known that a temporal offset would not be perceived by a human viewer if the audio media content leads the video media content by less than a threshold of about 0.015 seconds, or if the audio media content lags the video media content by less than a threshold of about 0.045 seconds. If such thresholds are exceeded, then it may be desirable to attempt to remove or reduce the temporal offset. One known technique for removing such a temporal offset is to apply some amount of delay to one of the audio media content and video media content components. Such a temporal offset can be a source of great discontent not only for viewers of an affected video, but also for those responsible for the creation and/or dissemination of the video, as they are often not immediately aware of the problem having occurred, and thus might not be in a good position to take steps to attempt to remedy the existing problem, and to try to prevent it from recurring in the future.
It would therefore be desirable to have improved systems and methods of measuring a temporal offset between video media content and audio media content introduced by a media channel that better address the issue of temporal offset.