1. Field of the Invention
The present invention relates to a speech recognition apparatus, a speech recognition method, a conversation control apparatus, a conversation control method, and programs therefor. More specifically, the present invention relates to a speech recognition apparatus, a speech recognition method, a conversation control apparatus, a conversation control method, and programs therefor that are capable of preferentially selecting candidates, which conform to or relate to topics of conversation in the past, utilizing conversation histories or the like in the past.
2. Description of the Prior Art
As a conventional method for recognizing a specific vocabulary in continuous speech recognition, word spotting for extracting a recognition candidate word, which is set in advance, from continuous conversation speech has been devised. It has been confirmed that, with this method, the word can be extracted efficiently if the number of words to be set is small. However, it is known that accuracy of the extraction falls as the number of words to be set increases. In addition, since a word other than the set words cannot be recognized with this method, the method cannot be used for an application that requires continuous speed recognition for a vocabulary. Therefore, there is a need for a method of recognizing mainly a large quantity of designated words in a framework of large vocabulary continuous speech recognition.
The speech recognition is a matter of estimating what a speaker has spoken from an observed speech signal. If the speaker has uttered a certain word and a characteristic parameter x has been obtained by characteristic extraction, w maximizing a posterior probability p(w|x) only has to be calculated on the basis of the theory of pattern recognition. Usually, since it is difficult to directly finding the posterior probability p(w|x), instead of finding w maximizing the posterior probability p(w|x), w maximizing p(x|w)P(w) is calculated on the basis of the Bayes' theorem (p(w|x)=p(x|w)p(w)/p(x)) (in this case, p(x) does not depend on w). P(x|w) is calculated from data that is obtained by learning as a probability of occurrence of a characteristic parameter in advance according to an acoustic model with a phoneme or the like as a unit. P(w) is calculated according to a language model with a word or the like as a unit.
As a framework for the large vocabulary continuous speech recognition, it has been confirmed that a method of calculating and comparing likelihoods for an inputted speech signal using a phoneme Hidden Markov Model and a statistical language model is effective. As the statistical language model, usually, it is a general practice to find a chain probability between two words or among three words for a large quantity of text data, which are prepared in advance, and using the chain probability at the time of speech recognition.
In general, in the speech recognition system as described above, in order to narrow down a large number of utterance candidates that are generated because speech cannot be recognized definitely, a “language model obtained by modeling an association among words” is used to extract an utterance candidate with a high recognition rate as an optimal utterance candidate. As such a language model, a statistical language model, which is established utilizing a corpus (language/speech database), is disclosed in Japanese Patent Application Laid-Open No. 2002-366190, and a language mode, which takes into account word pair restriction or the like grammatically, is disclosed in Japanese Patent Application Laid-Open No. 11-85180.
Such language modes are referred to as “conventional language models”. A language model, which associates words utilizing a “conversation history” for the narrowing-down, has not been proposed.
However, the speech recognition system using the conventional language models has a problem in that the recognition rate falls when short speech such as “chat” is inputted repeatedly or when an abbreviated sentence is used.
For example, a case in which speech uttered by a speaker concerning a topic of film shooting is subjected to speech recognition will be considered. When a user utters “kantoku” (director), the speech recognition system outputs plural utterance candidates, namely, (1) “kantaku”, (2) “kataku”, and (3) “kantoku”, from a speech signal generated by this utterance, and selects (1) “kantaku” with the highest recognition rate. Thus, even if a content, which is the same as speech (in this case, “kantoku”) of the user, is included in the utterance candidate, the speech recognition system cannot select the word as an optimal utterance candidate.
Therefore, it is considered necessary to establish a “mechanism (speech recognition system taking into account a conversation history)”, which utilizes a conversation history or the like in the past to select a candidate as an appropriate word even if a recognition rate of the candidate is judged low, and increase a speech recognition rate.