1. Field of the Invention
The present invention relates to a support device, a program and a support method. Specifically, the present invention relates to a support device, a program and a support method for supporting generation of text from speech data.
2. Description of Related Art
Recently, converting speech to text has been used to enhance accessibility for hearing-impaired people and elderly people. Such text is generated by use of a speech recognition device. For examples, see Tatsuya Akagawa, Koji Iwano, and Sadaoki Furui, “Model construction for spoken language text-to-speech using HMM, and the influence on the synthesized speech” (“HMM wo mochiita hanashikotoba onseigousei ni okeru moderu no kouchiku to sono gouseionsei eno eikyou”), The Journal of The Acoustic Society of Japan, 2007 March, p. 201-202; Yoshiyuki Yamada, Miyajima Chiyomi, Itou Katsunobu, and Takeda Kazuya, “A spontaneous speech recognition method by adjusting phoneme lengths” (“Onsochou shinshuku ni yoru taiwaonseininshikiseinou no koujyoushuhou”), Information Processing Society of Japan, IPSJ SIG Notes Vol. 2005, No. 103(20051021), p. 1-6; and Akira Baba, “Evaluation Method of Acoustic Models for the Elderly in Speech Recognition” (“Onseininshiki no tameno koureishamuke onkyoumoderu no hyoukahou”), Technical report of Matsushita Electric Works. Ltd, Special Issue on “Analysis and Evaluation Technology for Creating Customer Value” (“kokyakukachi wo soushutsu suru kaisekihyoukagijyutsu”), 2002 November, p. 20-26).
With the speech recognition device in the current state, it is difficult to generate 100% reliable text from speech data. In other words, text generated from speech data by the speech recognition device in the current state includes an unconfirmed part having a relatively low reliability. As a result, an operator has to correct the text by manually inputting a character string. However, such correction requires long working hours.
In the process of generating text from speech, the speech recognition device carries out processing for segmenting the speech, creating multiple candidate character strings for each segmented part, and selecting a character string from among the multiple candidates. Accordingly, the operator may correct the unconfirmed part having a relatively low reliability by causing the multiple candidate character strings to be displayed and by manually selecting an appropriate character string from among these candidates. However, since the speech recognition device creates an enormous number of candidate character strings, selection of a single character string from among the candidates also requires long working hours.
Moreover, the operator carries out such a correction of the unconfirmed part, for example, sequentially from the beginning in certain segmentation units (for example, every several characters). In this case, a support device is employable which is capable of automatically specifying a range of speech data corresponding to the character string whose content has been confirmed by the text correction and then automatically finding the top part of the next speech data to be subjected to text correction. By employing such a support device, the operation can be facilitated since the operator does not need to listen to the speech data for finding the top part of the next speech data to be subjected to text confirmation.
In order to automatically specify the portion where the text has been confirmed in the speech data, an acoustic analysis needs to be performed on the speech data by use of a computer. However, in the present circumstances, use of such a method is not sufficiently accurate to specify the portion where the text has been confirmed in the speech data.
Japanese Patent Application Publications Nos. 2000-324395, 2003-46861, and 2006-227319 disclose techniques for specifying a time range of speech data. Japanese Patent Application Publication No. 2000-324395 discloses a technique for segmenting a subtitle text on which a subtitle is based, and then assigning timing information to each segmented part according to reference timing information and character information. Here, the character information includes types of characters, the number of characters, and a string of phonetic signs. Japanese Patent Application Publication 2003-46861 discloses a technique with which, when a key input is made while a subtitle is displayed on a monitor, the operation timing and type of key are recorded. Japanese Patent Application Publication 2006-227319 discloses a technique for estimating a probability distribution of the duration lengths of components such as phonemes or syllables, and a probability distribution of the utterance rate.
However, the portion where the text has been confirmed in the speech data cannot be accurately specified even with the techniques disclosed above. Accordingly, under the present circumstances, an operator needs to listen to speech data in order to specify the portion of speech data corresponding to the character string whose text has been confirmed.