1. Field of the Invention
The present invention relates to an apparatus, a method and a computer program product for recognizing a speech by converting speech signals into character strings.
2. Description of the Related Art
Recently, human interface technologies based on speech input have been brought into practical use. For example, there is a speech-based operation system that enables a user to operate the system by vocalizing one of predetermined commands. The system recognizes the speech command and performs a corresponding operation. Another example is a system that analyzes any sentence vocalized by the user and converts the sentence into a character string, whereby producing a document from a speech input.
Technologies of speech-based interaction between a robot and a user are also actively studied and developed. Researchers are trying to instruct the robot to perform a certain action or access many kinds of information via the robot based on the speech input.
Such systems use a speech recognition technology of converting speech signals to digital data and comparing the data with predetermined patterns.
With speech recognition technologies, the speeches are subjected to be incorrectly recognized due to the effect of environmental noise, quality and volume of the user's voice, speed of the speech, and the like. It is difficult to correctly recognize dialects unless the spoken word is included in a word dictionary in the system. Furthermore, incorrect recognition can be caused by insufficient speech data and text corpus that are used to create features, probabilities, and the like included in standard patterns, word networks, language models and the like. The incorrect recognition can also be caused by deletion of correct words due to restricted number of candidates to reduce the computing load, and by incorrect pronunciation or rewording by the user.
Because the incorrect recognition can be caused by various factors, the user needs to change the incorrect portions to correct character strings by any means. One of the most reliable and simple approach is use of a keyboard, a pen device, or the like; however, use of such devices offsets the hands free feature that is an advantage of the speech input. Moreover, if the user can use the devices, the speech input is not required at all.
Another approach is to correct the incorrect portions by the user vocalizing the sentence again; however, it is difficult to prevent recurrence of the incorrect recognition only by rewording the same sentence, and it is stressful for the user to repeat a long sentence.
To solve the problem, JP-A H11-338493 (KOKAI) and JP-A 2003-316386 (KOKAI) disclose technologies of correcting an error by vocalizing only a part of the speech that was incorrectly recognized. According to the technologies, time-series feature of a first speech is compared with time-series feature of a second speech that was spoken later for correction, and a portion in the first speech that is similar to the second speech is detected as an incorrect portion. The character string corresponding to the incorrect portion in the first speech is deleted from candidates of the second speech to select the most probable character string for the second speech, whereby realizing more reliable recognition.
However, the technologies disclosed in JA-A H11-338493 (KOKAI) and JP-A 2003-316386 (KOKAI) are disadvantageous in that the incorrect recognition is likely to recur when there are homophones or similarly pronounced words.
For example, in Japanese language, there are often a lot of homophones for a single pronunciation. Furthermore, there are often a lot of words that are similarly pronounced.
When there are a lot of the homophones and similarly pronounced words, a suitable word could not be selected from such words with the speech recognition technologies, and thus the word recognition was not very accurate.
For this reason, in the technologies disclosed in JA-A H11-338493 (KOKAI) and JP-A 2003-316386 (KOKAI), the user needs to repeat vocalizing the same sound until the correct result is output, increasing the load of correcting process.