The present invention relates to a caption correction device. Particularly, the present invention relates to a device, a method, a program and the like for correcting, in real time, a caption for a speech recognition result of a presentation or the like.
In recent years, provision of captions for information transmitted through speech has been actively encouraged in order to secure accessibility for people with hearing disabilities, seniors and the like. It is also conceivable that there is a substantial need to provide captions for speech in presentations or the like. As conventional methods of providing such captions, the following two typical methods can be cited.
Respeak
A Respeak is a method in which a speaker in the middle respeaks speech made by an actual speaker toward a speech recognition system while listening to the speech. Since the respeaker in the middle is specially trained, he/she can respeak the speech at a high recognition rate even in such a difficult situation.
Stenography
A Stenography is a method in which, generally, a few people take tarns to input contents provided by a speaker while summarizing the contents.
However, it is conceivable that such manual provision of captions is unlikely to spread due to its high cost per unit time. For this reason, many methods have been proposed of creating captions in real time by using a speech recognition technique. For example, Japanese Patent Laid-Open Official Gazette No. Hei 6 (1994)-141240 discloses a technique for creating captions by speech recognition using a method of deciding optimum assumptions in production of TV programs, and the like. Moreover, Japanese Patent Laid-Open Official Gazette No. 2001-092496 discloses a technique for improving a speech recognition rate by 2-pass processing. On the other hand, techniques have been disclosed for supporting operations of checking and correcting speech recognition results, which are manually performed by a checker (judge), without relying solely on the speech recognition (for example, Japanese Patent Laid-Open Official Gazette Nos. 2003-316384, 2004-151614 and 2005-258198).
Generally, in speech recognition, desired recognition rates cannot necessarily be obtained in reality. For example, according to information from a certain demonstration experiment field, a recognition rate of at least 85%, preferably, 90% is required for real-time captions. A recognition rate of 85% may be achieved solely by the speech recognition. However, in reality, the recognition rate is heavily dependent on various conditions. For this reason, it is a reality that a sufficient recognition rate cannot be achieved in many cases.
For example, a result of a certain demonstration experiment shows as follows. The average recognition rate is 81.8% (range: 73.4% to 89.2%.) In addition, a probability that the recognition rate exceeds 85% is 27%, and a probability that the recognition rate exceeds 90% is 0%.
Furthermore, other than the problem associated with the recognition rate, there are also many problematic cases as follows. Words included in speech made by a speaker are erroneously converted into discriminatory expressions, provocative expressions and the like, which are not intended by the speaker, by the speech recognition in the same manner. For example, “JI-TTAI” which means an entity is erroneously converted into “JI-I-TAI” which means own dead body, or the like. Then those expressions are displayed as captions without being corrected, causing a problem.
Moreover, for the speech recognition, handling of proper names is very important. For this reason, many systems have a dictionary registration function. However, there is a case where, when several words are registered, words have the same sound, but are written in different Chinese characters from one another. In this case, it is often hard to judge which one of the words is intended, and incorrect conversion is carried oat. For example, for the name “Yasuko”, it is not at all uncommon that a plurality of candidates are registered as different proper names that have the same sound, as is the case with “Brown” and “Browne” that have the same sound but different spells. Similarly, such systems are surely provided with functions for registration and setting of forms such as numerical values as well. However, the registration is performed in a single uniform way. Accordingly, there is no way of judging, word by word, which one of the forms is intended by the speaker in the case of free speech.
The methods as described in Japanese Patent Laid-Open Official Gazette Nos. Hei 6 (1994)-141240 and 2001-092496 depend solely on the speech recognition result, and do not include a method of checking by humans, a method of correcting incorrect recognition or the like. Accordingly, it is conceivable that the methods are less effective in handling the provocative expressions and discriminatory expressions not intended by the speaker.
Moreover, Japanese Patent Laid-Open Official Gazette No. 2003-316384 discloses the following method. Specifically, when speech is made by a speaker, the speech is converted into a text. A checker judges whether or not each word included in the converted text is incorrect. Thereafter, when a word is judged to be incorrect, the judgment is presented to the speaker. Then the speaker is urged to repeat the speech over and over again until the speech is correctly transcribed. However, this method places a burden on the speaker. Furthermore, from the technical perspective, no matter how many times the words not correctly transcribed are repeated, it is not necessarily that those words are correctly transcribed in the end. For this reason, overhead on the speaker increases, and thereby a problem concerning the real-time characteristic also remains.
Furthermore, in the method as described in Japanese Patent Laid-Open Official Gazette No. 2004-151614, it is conceivable that problems still remain of the real-time characteristic and costs since check and correction are all manually performed.
Meanwhile, the method of Japanese Patent Laid-Open Official Gazette No. 2005-258198 discloses a device for setting the timing of displaying predetermined contents of speech in synchronization with reproduction of the speech. However, a method of achieving real-time caption display is not disclosed.
As described above, there are many problems in production and correction of real-time captions. Problems to be solved by the present invention are as follows.
Specifically, the first problem to be solved by the present invention is concerning the real-time characteristic. As the solution of this problem, the present invention provides a caption display system which can display captions generated by converting speech into characters, in real time (in other words, within maximum allowable delay time.) In addition, the second problem to be solved by the present invention is concerning the cost. As the solution of this problem, the present invention provides a caption display system which uses a method less expensive than conventional methods such as the respeak and the stenography. Moreover, the third problem to be solved by the present invention is concerning the speech recognition. As the solution of this problem, in the present invention, keyword matching is performed so as to further improve understanding compared with the case of performing the simple speech recognition. By performing the keyword matching, incorrect conversions of discriminatory and provocative expressions not intended by a speaker and incorrect conversions of proper names and forms, are avoided as much as possible, compared with the case of performing the simple speech recognition.