As the performance of information technology devices has been dramatically improved in recent years and more computer networks such as broadband networks have come into service, distribution of digital content such as video content has become popular. Video content are comprehensible to users and convincing compared with static content such as text. As cable and communication satellite television broadcasting is coming into widespread use, more television channels become available. It is expected that videos will be widely used in various application domains.
To appropriately provide information in video images to more users, it is necessary to display captions representing content of speech. A study group in Japan has set a goal of captioning 100 percent of telecast videos by 2007. Accordingly, advances in the technology of applying proper captions to video are social demands.
The following documents are considered:                [Patent document 1] Published Unexamined Patent Application No. 10-254478.        [Patent document 2] Published Unexamined Patent Application No. 2000-89786.        [Patent document 3] Published Unexamined Patent Application No. 10-136260.        [Non-patent document 1] Seigo Tanimura et al., “Automatic Alignment of a Sound Track to a Script in a TV Drama” (Natural Language Processing, 26-4, May 28, 1999).        
Methods have been proposed for generating captions by using speech recognition technology in which speech is recognized and character strings representing content of the speech are generated. However, speech recognition technology can make speech recognition errors and consequently reproduce false character strings. Furthermore, speech recognition technology cannot appropriately display punctuation marks and symbols because they are not orally expressed. Therefore, speech recognition technology cannot directly be applied to caption generation and modifications are made to results of speech recognition to generate captions (See patent document 2).
Another method has been proposed in which the script of speech in videos is divided into character strings of appropriate lengths and they are displayed at proper timing. However, even with the aid of sophisticated video editing software, it is difficult to manually determine the proper timing. Therefore, techniques have been proposed in which reproduced speech is compared with a script to determine the time point at which each character string in the script should be displayed (See patent documents 1 and 3). Non-patent document 1 will be described later.
The techniques described in patent documents 1 and 3 first analyze speech and assume a period during which no utterance has appeared to be a break between sentences. Then, the phoneme at the beginning of a sentence, which was obtained through speech analysis, is compared with the phoneme at the beginning of each of the sentences in the script to produce the correspondence between speech and the script text. This correspondence shows that each sentence in the script should be displayed at the time point at which speech corresponding to that sentence is sounded out.
However, a duration during which no utterance appears is not necessarily a break between sentences. For instance, a speaker can make pauses when hesitating, being puzzled, breathing or momentarily thinking, or for emphasis, or various other situations. Therefore, it is difficult to properly identify a break between sentences and find speech corresponding to each sentence in a script by using the above technologies. If speech and display of captions do not coincide, problems will arise that no caption is displayed when a speaker has started speaking, or the answer to a quiz may be displayed before a speaker starts to speak.
Moreover, because these technologies directly display each sentence in a script as a caption without modification, sentences cannot be divided or combined with consideration given to readability to users or the size of a display screen. Furthermore, the technologies generate similar captions regardless of the accuracy of speech recognition, therefore they will be unable to improve the accuracy of captioning even if the accuracy of speech recognition will increase in the future.