Prior art approach includes a word spotting technique as disclosed in Japanese Patent 4-362699. In general, word spotting generally does not rely upon a particular pair of speech event boundaries. In other words, in a pure word spotting approach, all possible beginnings and endings are implicitly selected and are considered for the pattern-matching and recognition-decision process. For example, a continuous dynamic programming matching technique (a DP matching) continuously adjusts input data in the time domain to enhance matching results. In the word spotting approach, the best matching result is defined to have a minimal DP value between input voice data and standard voice pattern data. To determine a speech or word portion in voice data, the minimal DP value point is back tracked to a beginning point.
The word spotting technique in general has a partial matching problem. In other words, a portion of the whole word is matched, and the partial match is erroneously recognized as an output. For example, if a Japanese word "roku" meaning a number six is inputted for voice recognition, the word spotting technique finds at least two matches including the entire word "roku" and a partial match "ku," meaning a number nine. According to one experiment as disclosed in Japanese Patent Hei 4-362699, "roku" has a DP value of 3.51 while ku has a DP value of 3.34. Thus, a partial match, "ku" is erroneously selected as the best match.
In order to correct the above described partial match artifact, the DP values are weighed according to the length of the match according to Japanese Patent Hei 4-362699. The DP values are multiplied by a weight value which has a smaller value for a shorter match. As a result of the multiplications, the weighted or corrected DP values for the two matches in the above example are now reversed, and the entire word "roku" is now correctly recognized for an output.
According to a second prior art approach, Japanese Patent Hei 5-127696 discloses a corrective method using a statistical tendency for similarity based upon the length of a match. In other words, the length of input data is determined, and a similarity between the input data and standard data is calculated. These pairs of the values are considered as an original data set. Based upon the original data set, a statistical tendency is determined between the two parameters and the second comparison standard data is generated. The input data is then compared against the second comparison standard data so as to reduce the erroneous partial matching results.
According to a third prior art approach, Japanese Patent Application JP95-00379 disclosed a technique for reducing erroneous partial matching problems in word spotting based upon a number of frames. According to this technique, the number of frames is conventionally determined if a similarity between input data and standard data is above a predetermined threshold. If clusters of continuous frames are independent or non-overlapping with each other, each cluster is used to recognize a voice output and the recognized standard is outputted as a voice recognition result. On the other hand, if the continuous frame clusters are overlapping with each other, the length of each cluster is compared for selecting the longest frame cluster as a voice recognition result. In case of the same length or tie in the frame cluster length, the cluster with a higher similarity value is selected.
In view of the above described prior art approaches, an efficient technique for determining a number of continuous frames for each input data is desired for a real-time voice recognition. To conventionally determine the frame length, in general, the number of frames is determined in a retroactive fashion after a certain predetermined period of silence is confirmed in the input voice data. That is, the path is retraced in input voice data to count a number of frames. Further more, in determining a speech boundary using the power or zerocrossing information, a speech ending must be first determined. In any case, either of the above techniques requires a predetermined amount of time and or a certain amount of processing time. For these and other reasons, it is difficult to implement a fast voice recognition system without a real-time frame counting technique and let alone, a real-time voice recognition with a substantially reduced rate for erroneous partial matching.