As well-known in the art, speech recognition application systems can be generally classified into an application system employing an isolated word speech recognizer and an application system employing a connected speech recognizer.
Of the speech recognition application systems, the application system employing an isolated word speech recognizer mostly has a simple command-oriented, small speech recognizer applied thereto. Therefore, its application is limited and speech recognition errors appear in a relatively simple form, so there is generally no difficulty in handling these errors.
However, the application system employing a connected speech recognizer is used in various applications and the range of recognition targets is very wide, which often leads to dissatisfaction in the performance of the system. This is basically due to limitations in speech recognition technology.
Especially, a language modeling technology applied to connected speech recognition can be divided into a rule-based language model, such as a finite state network (FSN) or a context-free grammar (CFG), and an N-Gram-based language model. The rule-based language model has a disadvantage in that it is applicable only to a relatively limited number of fields because it is difficult to model various utterances of a user in detail, while the N-Gram-based language model has a wider application range because it is capable of modeling various utterances of a user.
However, although a speech recognition result that is similar to an utterance intended by the user can be generally produced if the scale of corpus used for training is enough and the corpus is optimized for a domain, it is almost impossible to actually construct such an optimized corpus for each application field. Thus, it is common knowledge, in reality, that a result of speech recognition employing the N-Gram language model is imperfect.
If any error resulting from this problem occurs, the speech recognition application system may have great difficulty in handling the error. For instance, if even a single word is different, an automatic interpretation system, which is a kind of application system, may produce a translation result which is quite different from the user's intention. In the case of a speech dialogue system, which is another kind of application system, if a dialog manager accepts this error as it is, a handling error may occur or the handling itself may be impossible.
In the event of this error, in order to solve this problem, the user has to keep uttering until a correct recognition result is produced. This leads to a significant decrease in user satisfaction. Even worse, if no advance preparation has been made for a certain speech even though a correct recognition result thereof is produced, an error may occur in the processing of the system and therefore preparations for these situations are required. At present, however, there exists no countermeasure technology that can solve this problem.
As described above, what is important for a speech recognition application system is to produce a corrected speech recognition result reflecting a user's intention, rather than to recognize a user's speech as it is. Further, the development of a technique for producing a speech recognition result reflecting a user's intention is desperately needed.