1. Field of the Invention
The present invention relates to a speech recognizing apparatus, a speech recognizing method, and a speech recognizing program which are preferable to reject environmental noises other than a recognizing target vocabulary.
2. Description of the Related Art
Recently, the improvement in performance of a speech recognizing technology causes the widely practical use of a speech recognizing engine under a real environment. In particular, the case of limiting input devices in a car navigation system or a mobile device and the like increasingly requires the speech recognition. Under the above-mentioned environment, a hand-free function is raised as one of the functions strongly desired for the speech recognition, in order to continuously capture the speech and shift to predetermined processing only in the occasion when a previously-registered vocabulary is inputted.
For example, in the car navigation system, various environmental noises such as noises generated by running, klaxon, and noises generated by another-vehicle running are inputted to the speech recognizing engine during continuously capturing the speech under the real environment. Thus, the speech recognizing engine requires a function for correctly recognizing the user's speech and rejecting non-speech such as the various environment noises.
The conventional speech recognizing apparatus compares a recognizing target vocabulary formed based on a phoneme model with the amount of characteristics extracted from an input speech, and outputs the highest value of the comparison result that respective recognizing target vocabularies coincide with the time series of the amount of characteristics (hereinafter, referred to as a likelihood) as a result of recognizing the speech. A likelihood of the input of environmental noises becomes relatively low comparing with the likelihood of the input of recognizing target vocabulary. Therefore, non-speech is rejected by setting a predetermined threshold. However, when the real environment is different from the environment under which the recognizing target vocabulary is formed, the likelihood of the inputted recognizing target vocabulary might become low and even the recognizing target vocabulary might be rejected too.
Then, a method for rejecting the input of a non-registered vocabulary to the speech recognizing engine is used on a certain occasion, as disclosed in “Rejection of unknown speech by correcting likelihood using syllable recognition” presented in the Institute of Electronics, Information and Communication Engineers (IEICE) transactions D-II, Vol. J75-D-II, No. 12, pages 2002–2009 (hereinafter, referred to as a first literature).
According to the method disclosed in the first literature, the likelihood is calculated by comparing the input speech against the recognizing target vocabulary and also the optimal phoneme series is obtained by using all phoneme models against a previously stored recognizing-unit, thus obtaining the likelihood. The likelihood as a result of comparing the input speech against the recognizing target vocabulary largely varies between the recognizing target vocabulary (registered vocabulary) and the non-registered vocabulary and; on the other hand, the variation of likelihood in the optimal phoneme series is small. Even when the real environment is different from an environment under which the recognizing target vocabulary and the phoneme model as the previously stored recognizing-unit are formed, the influence in environment caused in the input speech appears both in the likelihood of the optimal phoneme series and in the likelihood of the recognizing target vocabulary. Therefore, a value obtained by subtracting the likelihood as the result of comparing the input speech against the recognizing target vocabulary from the likelihood of the optimal phoneme series does not vary irrespective of the difference of the environment. Rejection is accurately performed by detecting the non-registered word depending on the difference between the likelihoods.
However, while an unknown input speech has no problem, in the case of input sound which is not included in the phoneme model such as klaxon, both the likelihood of the optimal phoneme series and the likelihood of the recognizing target vocabulary are extremely low. The difference between the likelihood of the optimal phoneme series and the likelihood of the recognizing target vocabulary sometimes happens to be relatively low. In such case, the rejection is impossible by determining the threshold.
Further, another method for rejecting the input of the non-registered vocabulary to the speech recognizing engine is used, as disclosed in Japanese Unexamined Patent Application Publication No. 11-288295 (hereinafter, referred to as a second literature). In this proposal, words as the recognizing target vocabulary are previously stored and even words which are erroneously recognized as noises are stored as the recognizing target vocabulary including the environmental noise.
According to the method disclosed in the second literature, when a word having the maximum compared likelihood is included in the recognizing target vocabularies as a result of comparing the input speech against the stored recognizing target vocabulary, the recognizing result is outputted. On the contrary, when a word having the maximum compared likelihood is included in the recognizing target vocabularies containing the environmental noise, the input speech is determined as a noise and rejected.
However, the above-mentioned proposal of the second literature needs to store, as the recognizing target vocabulary including the environment noise, the words which are erroneously recognized as the noise in many cases. When the environment of the speech recognition is unspecified, the words which tend to be erroneously recognized as the noises in many cases cannot substantially be provided under any noise environment.
The above-mentioned speech recognizing apparatuses according to the first and second literatures have a problem in that the rejecting performance is not obtained for the input of only the environmental noise including no speech.