Known keyword recognition systems compare incoming speech to templates which are stored parametric representations of spoken examples (tokens) of keywords to be detected. Typical recognition algorithms seek the closest match (minimum Euclidean distance) between the incoming speech and the keyword templates. Using a method known as "dynamic time warping" (DTW), the recognition algorithms can modify the stored representations of the stored keyword templates into time alignment with representations of the incoming speech. The modified keyword templates are termed time-elastic because they can be either expanded or compressed in time to account for variations in pronunciation or speaking rate. A keyword is "recognized" by the system when a segment of the incoming speech is detected as being sufficiently similar to a keyword template after optimal time alignment.
DTW was originally used for recognition of isolated words with known endpoints, or of words in continuous speech having their endpoints determined by word boundary detection. An example of wordspotting in continuous speech is described in the publication "An Efficient Elastic-Template Method for Detecting Given Words in Running Speech", by J. S. Bridle, published in British Acoustical Society Spring Meeting, April 1973, pages 1-4, which is incorporated by reference. Typically, as each frame i of a segment of incoming speech is received, the wordspotting method uses the DTW algorithm to compute the optimal path to time alignment of the keyword templates and maintains a "dissimilarity score" based on the computed distance of the templates from the segment of input speech. A keyword is declared to have occurred if the dissimilarity score falls below a fixed threshold level.
Systems employing the known wordspotting methods frequently make two types of errors: keywords are spoken but not detected ("misses"); and keywords are detected when they have not been spoken ("false alarms"). Increasing the sensitivity of the system by raising the threshold level to avoid misses tends to increase the rate of false alarms. Wordspotting is particularly difficult when there is a lack of constraints on the input speech, i.e. the speaker is assumed to be non-cooperative.
The DTW method has also been extended to connected speech recognition. Connected speech recognition (CSR) methods attempt to go beyond the constraints of conventional wordspotting by finding the optimal sequence of templates matching a longer segment of continuous or connected speech, such as a phrase or sentence. CSR methods are often employed when it is known in advance that the input speech will be made up of a sequence of vocabulary words for which stored templates exist. All of the stored templates are compared to the input speech, and an optimal template sequence is concatenated by tracking the minimum total distance of the template sequence from the input speech. Such known CSR methods are described in "Partial Traceback and Dynamic Programming", by P. F. Brown, et al., Proc. ICASSP 1982, pages 1629-1632, "Experiments in Connected Word Recognition", by H. Ney, Proc. ICASSP 1983, pages 288-291, and "An Algorithm for Connected Word Recognition", by J. Bridle, et al., Proc. ICASSP 1982, pages 899-902, which are also incorporated herein by reference. Such CSR methods employ dynamic time warping algorithms to determine optimal time alignments, and dynamic programming methods for determining the best matching template sequence producing the minimum accumulated distance from the input speech. "Trace back" of the optimal template sequence and time alignment is applied from the end of the utterance when the utterance has been completed, or from a frame within the utterance. In maintaining a minimum distance score, time alignment distortions may be compensated for by adding fixed penalties for specific types of time axis distortions. When the optimal template sequence has been determined, one or more keywords contained in that sequence are deemed to be "recognized".
Systems employing the known CSR methods frequently give false alarms when the incoming speech contains acoustic patterns of speech that may be similar to those of keywords, e.g. sequences of short syllables, words, or speech sounds which resemble keywords. That is, an "optimal" template sequence may include a keyword template triggered by a similar acoustic pattern of input speech. The frequency of false alarms depends upon the distinctiveness in the language of the keyword acoustic patterns. Another shortcoming of systems employing the known CSR methods is that they require a large amount of computational power, and it is difficult to implement practical systems that operate in real time. All of these shortcomings have restricted the development of keyword recognition systems for practical applications.