1. Field of Invention
The present invention relates to a similar word discrimination method that discriminates pronunciation and a similar word discrimination word apparatus. In particular, it relates to a similar word discrimination method and apparatus in voice recognition technology that use the dynamic recurrent neural networks (DRNN) word model as one of the voice recognition technologies for unspecified speakers.
2. Descritption of Related Art
Voice recognition technology exists which uses the DRNN voice model as the voice recognition technology for unspecified speakers. Applicants have described using the voice recognition technology accomplished using DRNN as disclosed in Japanese Laid Open Patent Application Nos. 6-4079 and 6-119476, as described below.
With the DRNN voice model, a characteristic vector series of some words is input as time series data. Then, in order to obtain an appropriate output for the words, there is a build up between each unit in accordance with a pre-leaming precedent, and a bias is respectively determined. As a result, an output is obtained that is close to the taught output for the words, in relation to spoken voice data of non-specified speakers.
For example, the time series data of the characteristic vector series of the word "ohayo--good morning" of some unspecified speaker is input. In order to obtain an output that is close to the taught output that is ideally output for the word "ohayo--good morning", data for each respective two dimensions of the characteristic vector of the word "ohayo--good morning" are applied to the corresponding input unit, and converted by the established buildup on the learning precedent and bias. Then, time series processing is accomplished for each of the characteristic vector series of some input single word as the time series data. Thus, output that is close to the taught output for the word is obtained for the voice data spoken by some non-specified speaker.
With regard to the DRNN voice model prepared for all of the words that should be recognized, the leaming precedent which changes the buildup to obtain an appropriate output for the respective words is recorded from pages 17-24 of the communications sounds technological report of the electronic information communications association publication "Technical Report of IEICI sp 92-125 (1993-01)."
A simple explanation is provided with reference to FIGS. 7a-7c concerning voice recognition that uses the learning DRNN voice model with regard to some predetermined words.
With the voice recognition technology accomplished by the DRNN format, certain common keywords are preprogrammed (for example, "ohayo--good morning", "tenki--weather", etc.). These key words are recognized subject words from within continuous speech. (For example, "ohayo, ii tenki desu ne--good morning, it's good weather isn't it?") These key words receive a value that shows the level of correctness that exists in the components of the input voice. On the basis of the value that shows the level of correctness, understanding of the continuous speech is accomplished.
For example, if the speaker making the input says, "ohayo, ii tenki desu ne--Good morning. It's nice weather, isn't it?", then a speech signal such as that shown in FIG. 7(a) is output. For this type of speech signal, an output is obtained such as that shown in FIG. 7(b) for the speech signal components "ohayo--good morning". In addition, an output is obtained such as that shown in FIG. 7(c) for the speech signal component "tenki--weather". In FIGS. 7(b) and (c), the numerical values 0.9 and 0.8 are numerical values that show the level of correctness (similarity) between the inputted words and the preregistered key words. If the numerical value is as high as 0.9 or 0.8, then the words of the vocal input have a high level of correctness. In other words, as shown in FIG. 7(b), the registered word "ohayo--good morning" exists with a level of correctness of 0.9 as component W1 on the time axis of the input voice signal. The registered word "tenki--weather", as shown in FIG. 7(c), exists with a level of correctness of 0.8 as component W2 on the time axis of the input voice signal.
Thus, recognition of the voice input can be accomplished using the DRNN word model by creating a DRNN word model for each respective word that becomes the subject of recognition.
Therefore, in the case when a DRNN word model is created for multiple words, learning can be accomplished by speaking a recognition subject word and another word together.
For example, as shown in FIG. 8, word 2 is the subject of recognition and word 1 is not. The output does not rise for the voice data of word 1, but learning is accomplished in that the output rises for the voice data of word 2 which occurs subsequently. If the chronology was reversed, the output is increased for the voice data of word 2, and the output declines for the subsequent continuing voice data of word 1.
Learning is accomplished in this manner by the DRNN word model. However, the problem with the voice recognition process that uses the DRNN word model created by this type of learning is that if words are spoken that are similar to the words that are the subject of recognition, a DRNN output will be made that has a certain level of correctness, even if the spoken words are not the subject recognition word.
This accomplishes learning at the time of learning the DRNN word model applied to the continuous two word voice data, described above. Ordinarily, learning is not accomplished at this time among the words having similar pronunciation. For example, in consider the words "nanji--what time" and "nando--what degree", that have similar sounds (referred to as similar words) At the time of creating the voice model relating to "nanji--what time", if there is continuity between the sound data "nanji--what time" and the sound data "nando--what degree", then the output increases for the vocal data of "nanji--what time" and for the similar "nando--what degree". In order to distinguish between these two similar words, the word discrimination method must create a contradiction in the learning of the components that are part of the same vocal sound series of the two words that are part of the same sound "nan--what".
In the case where the learning DRNN word "nanji--what time is it" is the recognition subject word and where the spoken input of a speaker is "nando--what degree is it", there are many situations where the DRNN output for the spoken word "nando--what degree is it" is determined to be "nanji--what time is it".
In addition, there are situations where the user may want to add the recognition capability of the word "nando--what degree is it" to the word "nanji--what time is it" that has been pre-recorded as the learning recognition subject word. In order to be able to accurately recognize a similar word, there is a need for a simple word discrimination process.