The present invention relates to speech recognition, and, more particularly, to speech recognition wherein spoken word end points are not predetermined.
Recognition of isolated words from a given vocabulary for a known speaker has been known for some time. Words of the vocabulary are prestored as individual templates, each template representing the sound pattern for a word in the vocabulary. When an isolated word is spoken, the system compares the word to each individual template representing the vocabulary. This method is commonly referred to as whole-word template matching. Many successful recognition systems use whole-word template matching with dynamic programming to cope with nonlinear time scale variations between the spoken word and the prestored template.
Although this technique has been effective for isolated word recognition applications, many practical applications require continuous word recognition. In continuous word recognition, the number of words in a phrase can be unlimited and the identity of the earlier words can be determined before the end of the phrase, whereas in isolated word recognition, delimiters are used to identify the beginning and ending of input patterns and recognition occurs one word at a time. Moreover, a continuous speech recognition system must distinguish an input pattern from other recognizable patterns, background noise, speaker induced noise such as breathing noise, while isolated recognition cannot usually tolerate other recognizable patterns at the beginning or ending of the word.
In "Two level DP Matching--A dynamic programming based pattern matching algorithm for connected word recognition", H. Sakoe, IEEE Trans. Acoustics, Speech and Signal Processing, Vol.ASSP-27, No.6, pp.588-595, Dec. 1979, the method of whole-word template matching has been extended to deal with connected word recognition. The paper suggests a two-pass dynamic programming algorithm to find a sequence of word templates which best matches the whole input pattern. In the first pass, a score is generated which indicates the similarity between every template matched against every possible portion of the input pattern. In the second pass, the score is used to find the best sequence of templates corresponding to the whole input pattern.
This extended method has distinct disadvantages. One disadvantage of this technique is the amount of computation time it requires. Depending on the specific design requirements, this limitation may create unwarranted need for an expensive high-speed processor.
Another disadvantage of this method is that the endpoints of the input pattern must be predetermined and the whole input pattern must be stored in the system before any accurate template matching can occur. For an input pattern of any significant length, recognition response time would be substantially degraded. Also, errors in endpoint detection will seriously degrade recognizer performance. Further, the memory required to store this information may become excessive.
In "Partial Traceback and Dynamic Programming", P. Brown, J. Spohrer, P. Hochschild, and J. Baker, IEEE Trans. Acoustics, Speech and Signal Processing, Vol.ASSP-27, No.6, pp.588-595, Dec. 1979, a technique is described which allows for continuous speech recognition of arbitrarily long input patters without predetermination of endpoints. This is accomplished using a technique called partial traceback. Partial traceback allows outputting of recognized words prior to completion of the complete input pattern without sacrificing recognizer performance. However, the partial traceback technique described appears to be processor burdensome and cumbersome to implement.
Accordingly, there is a need for a continuous speech recognition system which can easily be implemented, yet can operate efficiently and inexpensively in real time.