In conventional voice recognition apparatuses, recognition of a voice uttered by a user is performed by referring to a dictionary in which words to be recognized are registered.
Therefore, in the voice recognition apparatus, only words which are registered in the dictionary (hereinafter, such words will be referred to simply as registered words) can be recognized, and words which are not registered in the dictionary cannot be recognized. Herein, words which are not registered in the dictionary are referred to as unregistered words. In the conventional voice recognition apparatus, if an utterance made by a user includes an unregistered word, the unregistered word is recognized as one of words (registered words) registered in the dictionary, and thus the result of recognition of the unregistered word becomes wrong. If an unregistered word is recognized incorrectly, the incorrect recognition can influence recognition of a word prior to or subsequent to the unregistered word, that is, can cause such a word to be recognized incorrectly.
Therefore, it is required to properly deal with unregistered words so as to avoid the above problem. To this end, various techniques have been proposed.
For example, Japanese Unexamined Patent Application Publication No. 9-81181 discloses a voice recognition apparatus in which a garbage model for detecting an unregistered word and an HMM (Hidden Markov Model) associated with phonemes such as vowels are simultaneously used so as to limit phoneme sequences associated with the unregistered word thereby making it possible to detect the unregistered word without needing complicated calculations.
As another example, Japanese Patent Application No. 11-245461 discloses an information processing apparatus in which when a word set including an unregistered word is given, the similarity between the unregistered word which is not included in a database and a word included in the database is calculated on the basis of the concepts of words, and a sequence of properly arranged words is produced and output.
As still another example, “Dictionary Learning: Performance Through Consistency” (Tilo Sloboda, Proceedings of ICASSP 95, vol. 1, pp. 453–456, 1995) discloses a technique in which phoneme sequences corresponding to voice periods of words are detected and phoneme sequences which are acoustically similar to each other are deleted using a confusion matrix thereby effectively constructing a dictionary including variants.
As still another example, “Estimation of Transcription of Unknown Word from Speech Samples in Word Recognition” (Katsunobu Ito, et at., The Transactions of the Institute of Electronics, Information, and Communication Engineers, Vol. J83-D-II, No. 11, pp. 2152–2159, November, 2000) discloses a technique of improving estimation accuracy of a phoneme sequence when the phoneme sequence is estimated from a plurality of speech samples and an unknown (unregistered) word is registered in a dictionary.
One typical method for dealing with an unregistered word is to, if an unregistered word is detected in an input voice, register the unregistered word into a dictionary and treat it as an registered word thereafter.
In order to register an unregistered word into a dictionary, it is required to first detect a voice period of that unregistered word and then recognize the phoneme sequence of the voice in the voice period. The recognition of the phoneme sequence of a voice can be accomplished, for example, by a method known as a phoneme typewriter. In the phoneme typewriter, a phoneme sequence corresponding to an input voice is basically output using a garbage model which accepts any phonemic change.
When an unregistered word is registered into a dictionary, it is required to cluster the phoneme sequence of the unregistered word. That is, in the dictionary, the phoneme sequence of each word is registered in the form of a cluster corresponding to the word, and thus, to register an unregistered word into the dictionary, it is required to cluster the phoneme sequence of the unregistered word.
One method of clustering the phoneme sequence of an unregistered word is to input, by a user, an entry (for example, a pronunciation of the unregistered word) indicating the unregistered word and then cluster the phoneme sequence of the unregistered word into a cluster indicated by the that entry. However, in this method, the user has to do a troublesome task to input the entry.
Another method is to produce a new cluster each time an unregistered word is detected such that the phoneme sequence of the unregistered word is clustered into the newly produced cluster. However, in this method, an entry corresponding to the new cluster is registered into a dictionary each time an unregistered word is detected, and thus the size of the dictionary increases as unregistered words are registered. As a result, a greater time and a greater amount of process are necessary in voice recognition performed thereafter.