Just as a caption in a book is the text under a picture, captions on video are text located somewhere on the picture. Closed captions are captions that are hidden in the video signal, invisible without a special decoder. The place they are hidden is called line 21 of the vertical blanking interval (VBI). Open captions are captions that have been decoded, so they become an integral part of the television picture, like subtitles in a movie. In other words, open captions cannot be turned off. The term “open captions” is also used to refer to subtitles created with a character generator.
Within the prior art, captions are commonly generated by voice recognition, manual human entry, or a combination of these techniques. Once generated by either approach, the captions have to be edited. In particular, the captions may have to be proofread for correctness, and properly and appropriately keyed to the video itself if not already accomplished by the caption-generation process. For instance, a given caption may have a timestamp, or temporal position, in relation to the video that indicates when the caption is to be displayed on the video. Furthermore, a caption may have a particular location at which to be displayed. For example, if two people on the video are speaking with one another, captions corresponding to spoken utterances of the left-most person may be placed on the left part of the video, and captions corresponding to spoken utterances of the right-most person may be placed on the right part of the video.
Within the prior art, there are three general types of conventional caption-editing systems. First, there is an editor-type caption-editing system, in which captions are edited for spoken utterances within video on a groups-of-line basis, without respect to particular lines of the captions and without respect to temporal positioning of the captions in relation to the spoken utterances. Such a caption-editing system may even include multiple-line editing capabilities within computer programs like word processors. In this type of system, there is no timestamping of the captions to the video, since the captions are generated for the video, or sections of the video, as a whole, without regard to temporal positioning. This type of system is also commonly referred to as “summary writing” or “listening dictation.” This type of system is useful where there are many errors in the captions themselves, since editing can be accomplished without regards to the different lines of the captions temporally corresponding to different parts of the video. However, it does require temporal positioning—i.e., timestamping—to later be added, which is undesirable.
Second, there is a line-based caption-editing system, in which captions are generated for spoken utterances within video on a line-by-line basis with respect to particular lines of the captions and with respect to temporal positioning of the captions in relation to the spoken utterances. Line-based caption-editing systems thus operate in relation to timestamps of the captions in relation to the video, on a caption line-by-caption line basis. This type of system is very effective for captions that are generated without errors, especially since temporal positioning—i.e., timestamping—is accomplished as part of the captioning process. However, where there are many errors within the captions, correction can become difficult, since the temporal positioning of the lines may become incorrect as a result of modification of the lines themselves. For instance, lines may be deleted, added, or merged, in the process of editing, which can render the previous temporal positioning—i.e., timestamping—incorrect, which is undesirable as well.
A third type of caption-editing system is a respeaking caption-editing system. In respeaking, a specialist with a proven high voice-recognition rate respeaks the voices of various speakers on video, in order to convert them into voices with a higher voice-recognition rate. This approach is disadvantageous, however, because it is very labor intensive, and requires the utilization of highly skilled labor, in that only people who have proven high voice-recognition rates should respeak the voices of the speakers on the video. Thus, of the three types of caption-editing systems within the prior art, the editor-type system is useful where voice recognition results in many errors, the line-based system is useful where voice recognition results in few errors, and the respeaking system is relatively expensive.
In a given video, however, there may be sections in which voice recognition achieves a high degree of accuracy on the spoken utterances in question, and there may be other sections in which voice recognition does not achieve a high degree of accuracy on the spoken utterances in question. Therefore, using an editor-type caption-editing system achieves good results for the latter sections but not for the former sections. By comparison, using a line-based caption-editing system achieves good results for the former sections but not for the latter sections. Therefore, there is a need for achieving good caption results for all sections of video, regardless of whether the voice recognition yields accurate results or not. For this and other reasons, there is a need for the present invention.