1. Field of the Invention
The present invention relates to a speech understanding apparatus.
2. Description of Related Art
Speech is promising as a means for accessing to information undergoing an explosive increase in quantity and qualitative complexity, and speech dialogue systems have been developed and operated to make it possible to implement such means. Since a speech dialogue system produces a response based on semantic expressions obtained from utterance of users, speech understanding unit which converts the utterance into the semantic expressions is vital. Speech understanding includes two processes: speech recognition for conversion of speech into word strings, and language understanding for conversion of the word strings into semantic expressions. The speech recognition requires a sound model and a language model, however, the sound model has no dependency on task domains of the speech dialogue systems. Accordingly, a language model and a language understanding model may be considered to be necessary for each domain.
In the case that only a speech understanding scheme according to a single language model and a single language understanding model is employed, it is difficult to realize speech understanding with high accuracy for different utterances. This is because combinations of the appropriate language model and language understanding model are different depending on utterance. For example, if a grammar model is used as a language model of speech recognition, highly-accurate speech recognition for utterance is possible in the grammar. However, this grammar model is weak regarding utterances other than the assumed utterance. An N-gram model has an advantage over the grammar-based language model in that the former has a local restriction and can be easily recovered even if unregistered words or misrecognition occurs. However, since the N-grammar model cannot express restrictions on all sentences, its performance for the assumed utterance is generally lower than that of the grammar-based language model. Similarly, the language understanding model has its advantages and disadvantages, and thus, in order to increase utterances which can be properly understood, a combination of multiple language models and multiple language understanding models is considered to be effective.
Use of multiple speech understanding schemes generates multiple understanding results, and thus it is necessary to obtain a final understanding result from the multiple understanding results. In many cases, a majority voting method such as a ROVER (Recognizer Output Voting Error Reduction) method has been conventionally used (see, for example, Jonathan G. Fiscus, “A post-processing system to yield reduced word error rates: Recognizer Output Voting Error Reduction (ROVER),” Proc. ASRU, pp. 347-354, 1997).
The above ROVER method obtains a final result by performing a weighted majority voting for multiple speech recognition results or multiple understanding results. However, for such majority voting, in some cases, if a scheme with high speech understanding capability is mixed with a scheme with low speech understanding capability, a result of the scheme with higher speech understanding capability may not be sufficiently reflected. For example, if a majority of multiple speech understanding results are incorrect and a minority of multiple speech understanding results are correct, correct speech understanding results are less likely to be obtained.