1. Field of the Invention
The present invention relates to an apparatus, a method, and a computer program product for performing a speech recognizing process and the like for input speech and outputting a result of the process.
2. Description of the Related Art
Recently, human interfaces using speech input have been progressively put to practical use. For example, speech operating systems have been developed that allow the user to vocally input a specific command that is previously set, recognize the command, and automatically execute the corresponding operation, thereby enabling the use of the system with speech. Systems that realize creation of sentences using speech input by analyzing arbitrary sentences vocalized by the user to convert the sentences into character strings have been also developed. Spoken dialogue systems that enable interaction between the user and the system using a spoken language and the like have been developed and already been utilized.
In speech recognizing processes used by the respective systems, contents of the speech produced by the user means are usually recognized by the following method. A produced speech signal is captured into a system by a microphone or the like, converted into an electrical signal, and sampled in units of a very short period of time using an analog-to-digital (A/D) converter or the like, to obtain digital data in a time sequence of waveform amplitude, for example. The digital data is subjected to a technique such as a fast Fourier transform (FFT) analysis to obtain for example changes in frequency according to time, thereby extracting feature data of the produced speech signal. Standard patterns of phonemes, for example, that are prepared as a dictionary associated with recognition results and a sequence thereof, and the feature data extracted by the process above mentioned are compared and matched using a hidden Markov model (HMM) method, a dynamic programming (DP) method, or a neutral network (NN) method, to generate recognition candidates of the contents of the produced speech. To enhance the recognition accuracy, a statistical language model as typified by N-gram is utilized for the generated recognition candidates to estimate and select the most probable candidate, thereby recognizing the contents of the produced speech.
In the speech recognition, performing the recognition 100 percent without error is quite difficult due to the following factors, and is considered next to impossible. Segmentation of speech into sections may not be properly made due to noises and the like in an environment where the speech inputting is performed. The waveform of the input speech can be transformed due to factors that vary between individuals such as speech quality, volume, speaking rate, speaking style, and dialect, and checking of the recognition results may not be accurately performed.
There are also cases that the recognition cannot be performed because the user speaks an unknown language that is not prepared in the system, that a word is erroneously recognized as an acoustically similar word, and that a word is erroneously recognized as a wrong word due to an imperfect standard pattern or statistical language mode that is prepared.
When the process is continued after the erroneous recognition, an erroneous operation is usually induced. Therefore, operations for elimination of influences of the erroneous operation, restoration, re-input of the same speech, and the like are required, which imposes burdens on the user. Even when the speech is input again, there is no guarantee that the erroneous recognition is always overcome.
Meanwhile, when the recognition result is corrected before continuation of the process to avoid such a problem, keyboard manipulation and the like are usually required. Accordingly, hands-free characteristics of the speech inputting are lost and the operational burdens on the user are increased.
The system mentioned above outputs the most probable candidate as a correct recognition result. Accordingly, even when the speech recognition ends in an erroneous recognition, the system itself has no way of knowing which part of the recognition is wrong and which part is correct. Therefore, to correct the erroneous recognition part, the user must determine the erroneous recognition part and then correct it.
In connection with such a problem, JP-A 2000-242645 (KOKAI) proposes a technology of generating not only one most probable speech recognition candidate but also plural speech recognition candidates having close recognition scores, translating the generated candidates, and presenting summaries of translation histories together with the plural translation results. This allows a conversational partner to recognize the reliability of the process result and easily assume the contents of speech of the speaker, and provides sufficient and smooth communication even when the performance of the recognition process is low.
However, in the method described in JP-A 2000-242645 (KOKAI), even when a recognition candidate to be selected is included in the proposed plural recognition candidates, the process cannot be continued when this recognition candidate includes an erroneous recognition part, and correction or re-input is required. Therefore, like in the typical technologies, the hands-free characteristics of speech inputting can be lost, or the burden on the user due to the correcting process can be increased.