In the prior art attempts, one of the speech recognition methods detects an end of each voice element. In such an endpoint detection method, a start and an end are determined in part based upon information on power. Each detected sound unit is then processed. Referring to FIG. 1, in a step S1, a recognition process is initiated, and in a step S2, an endpoint detection is performed on voice data so as to detect voice elements. In the step S2, a start of a voice element is generally determined based upon power information while an end is determined based upon a presence of a predetermined minimal period of silence. This end determination usually enables distinction between a voice element end and silence before a certain consonant as well as certain other silence in a word. In a step S3, the detected voice elements are then compared to a dictionary and an element that has the highest degree of similarity is selected. Lastly, the selected item from the dictionary is outputted as a result in a step 4.
The above described endpoint detection requires a certain predetermined silent period. In determining a voice element end, a period ranging from 250 milliseconds to 350 milliseconds is generally used. Since this method requires that for each potential ending, silence for the above described period has to be confirmed. In other words, this processing is not able to output a result for the above described period. Consequently, the method is considered to be rather slow in prior art. Although the above predetermined silence period was shortened so as to improve performance in a certain prior art system, an erroneous endpoint detection rate has increased since certain words include a short silent period. The erroneous endpoint detection problem is also compounded by the presence of certain speech which fills between words and does not necessarily have meaning. For example, when speech such as "ah" and "well" is uttered, words combined with the speech have a reduced degree of similarity.
In order to solve the above described problems of the endpoint detection method, referring to FIG. 2, a word spotting method has been used. In a step S11, speech recognition is initiated without detecting an end of speech recognition element. As soon as voice data is available, the inputted voice data is compared to a predetermined standard dictionary in a step S12. In a step S13, the result or the degree of similarity is further compared to a predetermined threshold value. If the result fails to exceed the predetermined threshold value, the above described steps are repeated. On the other hand, if the result exceeds the predetermined threshold value, the input voice data is outputted as recognition data in a step S14. Since the word spotting method generally renders an output without a delay for detecting an end, the method enables fast speech recognition. Furthermore, the word spotting method removes certain unnecessary speech and improves recognition accuracy.
Despite the above described advantages, the word spotting method is generally susceptible to speech recognition errors in detecting repetitive sound elements. For example, referring to FIG. 3, in the Japanese language, numbers, five, six and seven are respectively pronounced "go," "roku" and "nana." When "5677" is pronounced as shown, according to a known word spotting method, "go" and "roku" are correctly recognized. Since "nana" is repeated twice, because of continuous comparisons according to the word spotting method, "nana" is erroneously detected three times. It is desired that the above described erroneous detection of repetitive elements is substantially reduced while advantages of the word spotting are preserved.