Generally speaking, closed captions are text messages that appear at the bottom of a display screen during movies, television programs, news casts and other productions, to aid hearing impaired persons to understand the content of the production. The closed captions typically appear and disappear at times that are roughly synchronized to words that are spoken in the production. For example, where a television program includes a number of people engaged in conversation, the content of that conversation would appear at the bottom of the screen roughly synchronous to each conversant""s dialogue. Further, the closed captions could also indicate the presence of other sounds such as the playing of music or of a door slamming, to more completely indicate auditory clues that convey to the viewer what is happening.
Closed captions can be generated either on-line or off-line. On-line closed captions are typed into a system that merges them with the production while the action is occurring, such as during a live television news broadcast. Because of the immediacy of on-line generated closed captions, a higher percentage of errors are generated and the captions can be significantly misaligned with the corresponding spoken words. Alternatively, off-line closed captions typically include fewer errors due to the fact that they are generated post production, i.e. from pre-recorded materials such as movies or taped television programs.
Off-line closed captions are sometimes referred to as xe2x80x9cpop-on captionsxe2x80x9d due to the fact that they pop onto the screen as an actor speaks. Accordingly, off-line closed captions can be well placed such that they more closely agree with the actor""s speech. However, even off-line closed captions can be misaligned with respect to the related spoken words. To further aid the viewer, the off-line closed captions can be manipulated to appear shortly before the actor speaks and disappear shortly after the actor stops speaking, thereby providing extra reading time.
Regardless of the method used to perform closed captioning, an approximate textual transcript of the program is produced. That textual transcript includes information that indicate when each caption should appear on the screen and when it should disappear. That information includes the text of the caption, the time stamp when they should be displayed on the screen and the duration of time that they should remain displayed. Once the closed caption data is stored in a computer system memory, along with the related digitized audio and video data, it can be used in conjunction with an internet search engine to index portions of the related program. Accordingly, a search engine such as Compaq Computer Corporation""s AltaVista engine can retrieve relevant audio, video or transcript portions of programs in response to a keyword selected by a system user. Once the data is located, the selected portion of the program can be displayed either textually or in a multimedia manner.
While searching multimedia documents is useful for many applications, it requires that the closed caption data is very closely aligned with the related audio and video data. For example, if the closed caption data is closely aligned to the audio and video data, a user will not need to parse through a large amount of unrelated program time to view the desired information. It has generally been determined that in order for such searching to be effective, the alignment of the audio and video data to the closed caption data should be accurate to within a fraction of a second. In other words, there should be a very small discrepancy between the time that a word is spoken on the program and the time stamp value of the corresponding closed caption word.
Prior art approaches have been used to automatically or semi-automatically generate closed caption data from a non-time-stamped transcription of an associated audio data stream. Such approaches typically include a series of steps that are recursively applied until closed captions have been generated for the entire transcript. During a first pass, a vocabulary and language model is generated using the words of the entire transcript. That vocabulary and language model is used by a speech recognizer to generate a hypothesized word list from the audio data. The word list is annotated with time stamps that indicate the relative time within the audio stream that each word or group of words was detected. Since the hypothesized word list is only a best guess at the spoken words, a confidence score is generated that indicates the likelihood that a given word has been correctly recognized.
After the speech recognition operation is complete, the transcript is broken into sections that are delineated by words, referred to as anchors, that have high confidence scores. During a subsequent pass, a new vocabulary and language model is generated using words from a selected section and the speech recognition process is repeated. During each pass, smaller and smaller sections are identified until a majority of the closed caption stream agrees with the transcript and has been aligned with the audio stream. Such a process is effective for generating closed captions but is extremely time consuming due to its repetitive nature. Further, this approach does not take advantage of pre-existing time stamps, such as those recorded during transcription.
A method and apparatus is desired for automatically aligning closed caption data with associated audio data such that temporally precise off-line closed captioning operations are no longer necessary; such that closed caption data can be more precisely indexed to a requested keyword by a search engine; and for improving the quality of pre-existing closed captions. Further, with such a structure, closed captions can be made to appear and disappear in direct relation to associated spoken words and phrases. Accordingly, hearing impaired viewers can more easily understand the program that is being displayed.
More specifically, a method and apparatus are provided for aligning roughly aligned closed caption data with associated portions of an audio data stream. The method includes breaking the audio data stream and the roughly aligned closed caption data into a number of sections. The sections are delineated by a selected characteristic of the associated closed caption data, such as a significant time difference between time stamps. The sections are segmented into a number of chunks and the closed captions within each of the chunks are aligned to the associated portion of the audio stream.
The method for aligning roughly aligned closed caption data within a chunk can also include the steps of generating a dictionary, in which each word in the closed caption data is expanded into a sequence of phonemes, and forming a language model. A speech recognition operation is performed on the audio data stream to generate a sequence of words to be later associated with the words of the original transcription, in the audio stream. Subsequently, each word of the closed caption data is matched with a corresponding word in the audio stream. Responsively, the time stamps associated with the words of the closed caption data contained in the chunk are modified in relation to the time stamps of the associated words from the audio stream. The time stamps associated with the words of the closed caption data can be modified to be the same as the time stamps of the associated words from the audio stream. Alternatively, the time stamps associated with the first and last words of the closed caption data can be modified to be a selected time before and after the time stamps of the associated words.
After the chunk is aligned, a subsequent chunk that is associated with another portion of the roughly aligned closed caption data is selected using the last word of the previous chunk, i.e. the anchor, as its first word to which the chunk will be aligned.
The method also determines whether the speech recognition operation has correctly generated the words that correspond to the associated closed caption data. In response to a determination that the speech recognition operation has generated words that do not correspond to the associated closed caption data, a recursive alignment operation is performed. The recursive alignment operation identifies a pair of aligned words that delineates the un-aligned portion of the closed caption data that includes the incorrect words. A language model that includes the words from the un-aligned portion is generated and a second speech recognition operation is performed on the portion of the audio data stream that is associated with the un-aligned portion. Those operations are recursively performed until the roughly aligned closed caption data is aligned with the audio data.