Our invention relates to speech analysis and, more particularly, to dynamic time warping arrangements for speech pattern recognition.
Speech recognizers permit direct input to communication, data processing and control systems. A recognizer typically has a reference vocabulary stored digitally as acoustic patterns called templates. An input utterance is converted to digital form and compared to the reference templates. The most similar template is selected as the identity of the input. An overview of automatic recognition may be found in the article by S. E. Levinson and M. Y. Liberman entitled, "Speech Recognition by Computer", Scientific American, April 1981, Vol. 244, No. 4, pages 64-76.
In order to compare an input pattern, e.g. a spoken word, with a reference, each word is divided into a sequence of time frames. In each time frame, signals representative of acoustic features of the speech pattern are obtained. For each frame of the input word, a frame of the reference word is selected. Signals representative of the similarity or correspondence between each selected pair of frames are obtained responsive to the acoustic feature signals. The correspondence signals for the sequence of input and reference word frame pairs are used to obtain a signal representative of the global or overall similarity between the input word and a reference word template.
Since there are many different ways of pronouncing the same word, the displacement in time of the acoustic features comprising the word is variable. Different utterances of the same word, even by the same individual, may be widely out of time alignment. The selection of frame pairs is therefore not necessarily linear. Matching, for example, the fourth, fifth and sixth frames of the input utterance with the fourth, fifth and sixth frames respectively of the reference word may distort the similarity measure and produce unacceptable errors.
Dynamic time warping arrangements have been developed which align the frames of a test and reference pattern in an efficient or optimal manner. The alignment is optimal in that the global similarity measure assumes an extremum. It may be, for example, that the fifth frame of the test word should be paired with the sixth frame of the reference word to obtain the best similarity measure.
In U.S. Pat. No. 3,816,722 (Sakoe et al), feature signals are obtained for the frames of input and reference words. Local correspondence signals are generated between all possible pairs of input and reference word frames within a given range or window. An extremum path of local correspondence signals is selected by a dynamic programming recursive process. The generation of all local correspondence signals within a given window, however, may be burdensome. In a simple microprocessor system, for example, the Sakoe process is comparatively slow.
It is thus an object of the invention to provide an improved arrangement for time alignment of speech patterns which abates local correspondence signal generation.