The invention generally relates computer systems and computer executed methods for aligning video clips to closed caption files.
In general, video clips are short clips of video, usually part of a longer recording. If video clips originate from broadcast video content (e.g., over-the-air, cable, satellite, and so forth), there is frequently closed captioning associated with the broadcast. In general, closed captioning is the process of displaying text on a television, video screen or other visual display to provide additional or interpretive information to individuals who wish to access it. Closed captions typically show a transcription of the audio portion of a program as it occurs (either verbatim or in edited form), sometimes including non-speech elements.
Making video clips on the web, on smart phones, and so forth, matched up to the relevant closed caption text is less expensive than human transcription and yields better results than purely automated speech-to-text methods as the closed caption files were generated by a human. However, typically the closed caption will not exactly match the spoken words; it is usually quite different as the closed captioner focuses on important words, s/he makes mistakes, and so forth.
The closed caption also lags the broadcast video as the close captioner needs to watch/hear the video and then input the corresponding closed caption. This lag varies. For pre-recorded (as opposed to live) content, there may be no lag at all because the lag was already edited out. If one uses an automated technique, such as speech-to-text, to generate words from the video clip to assist in an alignment process, there will often be recognition errors. A variability in lag along with the errors in both the closed caption text and speech-to-text make alignment complicated.
Further, many media broadcasters do not have the closed caption text readily available, so frequently, one needs to capture closed captions from a live broadcast stream.