Closed captioning is a term describing several systems developed to display text on a television or video screen to provide additional or interpretive information to viewers who wish to access it. Closed captions typically display a transcript of the audio portion of a program as it occurs (either verbatim or in edited form), sometimes including non-speech elements. Most commonly, closed captions are used by deaf or hard of hearing individuals to assist comprehension. However, audio transcripts associated with video are also an important tool used for creating indexes or underlying metadata associated with video that can be used for many different purposes.
When indexing and associating metadata with videos, it is very important that the video and audio be correctly and closely aligned in order for the underlying metadata of each frame or scene of the video to match up correctly. Unfortunately, audio transcripts obtained from known sources, like human generated closed-captions or automated speech recognition (ASR) software, almost always introduce time lags. It has been observed in the industry, with production level data, that typical time lags associated with closed caption text, while often accurate, can cause audio transcripts to shift as much as 30 seconds or more with respect to the corresponding visuals. Such time lags introduce errors in time-based indexing and can create errors in the underlying metadata associated with a video—especially if the timeline of the audio transcript is relied upon and assumed to synch correctly with the timeline of the actual audio.
On the other hand, automated speech recognition (ASR) software, when used alone to try to generate an audio transcript of a corresponding video, usually captures the correct timeline and time location for each word or sound associated with the video, but ASR software still generates a number of errors in transcription and tends to miss some text, especially when there is a lot of background noise.
For these and many other reasons, there is a need for systems and methods that correctly and accurately calibrate the timeline for audio transcripts with the underlying audio and video, which improves not only the underlying metadata created by video indexing systems, but also provides an improved and automated system for synching closed-captioned text with the actual audio and corresponding video for later playback and use.
It will be understood that the present methods may also include and encompass computer-readable media having computer-executable instructions for performing steps or functions of the methods described herein and that the systems described herein may include computer networks and other systems capable of implementing such methods.
The above features as well as additional features and aspects of the present invention are disclosed herein and will become apparent from the following description of preferred embodiments of the present invention.