Many challenges exist in the efficient production of closed captions, or, more generally, time-aligned transcripts. Closed captions are the textual transcriptions of the audio track of a television program, and they are similar to subtitles for a movie show.
A closed caption (or CC) is typically a triplet of (sentence, time value and duration). The time value is used to decide when to display the closed caption on the screen, and the duration is used to determine when to remove it. Closed captions are either produced off-line or on-line. Off-line closed captions are edited and aligned precisely with respect to time by an operator in order to appear on the screen at the precise moment the words are spoken. On-line closed captions are generated live, during television newscasts for instance.
Captions can be displayed on the screen in different styles: pop on, roll-up or paint-on. Pop-on closed captions appear and disappear at once. Because they require precise timing, they are created post-production of the program. Roll-up closed captions scroll up within a window of three or four lines. This style is typically used for live broadcasts, like news. In that case, an operator who uses a stenotype keyboard enters the caption content live. The paint-on captions have a similar style to pop-on captions, except they are painted on top of the existing captions, one character at a time.
Captioning a video program is a costly and time-consuming process which costs approximately $1,000 per hour. That includes the whole service from transcription, time alignments and text editing to make the captions comfortable to read.
The number of closed-captioned programs increased dramatically in the United States because of new federal laws:
The landmark Americans with Disabilities Act (or ADA) of 1992 makes broadcasts accessible to the deaf and hard-of-hearing;
The FCC Order #97-279 requires that 95% of all new broadcast programs be closed captioned by 2006.
The TV Decoder Circuitry Act which imposes all televisions 13 inches or larger for sale in the United States to have a closed caption decoder built in.
In several other countries, legislation requires television programs to be captioned. On the other hand, digital video disks (DVD) have multi-lingual versions and often require subtitles in more than one language for the same movie. Because of the recent changes in legislation and new support for video, the demand for captioning and subtitling has increased tremendously.
The current systems used to produce closed captions are fairly primitive. They mostly focus on formatting the text into captions, synchronizing them with the video and encoding the final videotape. The text has to be transcribed first, or at best imported from an existing file. This is done in one of several ways: the typist can use a PC with a standard keyboard or stenotype keyboard such as those used by court reporters. At this point of the process, the timing information has been lost and must be rebuilt. Then the closed captions are made from the transcription by splitting the text manually in a word processor. This segmentation can be based on the punctuation, or is determined by the operator. At that point, breaks do not make any assumption on how the text has been spoken unless the operator listens to the tape while proceeding. The closed captions are then positioned on the screen and their style (italics, colors, uppercase, etc.) is defined. They may appear at different locations depending on what is already on the screen. Then the captions are synchronized with the audio. The operator plays the video and hits a key as soon as the first word of the caption has been spoken. At last, the captions are encoded on the videotape using a caption encoder.
In summary, the current industry systems work as follows:
Import transcription from word processor or use built-in word processor to input text;
Break lines manually to delimit closed captions;
Position captions on screen and define their style,
Time mark the closed captions manually while the audio track is playing;
Generate the final captioned videotape.
Thus, improvements are desired.
The parent invention provides an efficient system for producing off-line closed captions (i.e., time-aligned transcriptions of a source audio track). Generally, that process includes:
1. classifying the audio and selecting spoken parts only, generating non-spoken captions if required;
2. transcribing the spoken parts of the audio track by using an audio rate control method;
3. adding time marks to the transcription text using time of event keystrokes;
4. re-aligning precisely the transcription on the original audio track; and
5. segmenting transcription text into closed captions.
The present invention is directed to the audio rate control method of step 2, and in particular provides a method and apparatus for controlling rate of playback of audio data. Preferably using speech recognition, the rate of speech of the audio data is determined. The determined rate of speech is compared to a target rate. Based on the comparison, the playback rate is adjusted, i.e. increased or decreased, to match the target rate.
The target rate may be predefined or indicative of rate of transcription by a transcriber.
The playback rate is adjusted in a manner free of changing pitch of the corresponding speech.
Time domain or frequency domain techniques may be employed to effect adjustment of the playback rate. The time domain techniques may include interval sampling and/or silence removal.