Speech recognition is the ability of a computer controlled system to recognize speech. Wordspotting is an application of speech recognition technology that enables the location of keywords or phrases in the context of fluent speech.
The use of HMMs in wordspotting in speech recognition applications is well known in the art. See, for example:
J. R. Rohlicek, W. Russel, S. Roukos, H. Gish, "Continuous Hidden Markov Modeling for Speaker-Independent Word Spotting". Proc. of the Int. Conf. Acoustics, Speech and Signal Processing, Glasgow, Scotland, May 1989, pp. 627-630;
R. C. Rose, D. B. Paul, "A Hidden Markov Model Based Keyword Recognition System". Proc. of the Int. Conf. on Acoustics, Speech and Signal Processing, April 1990, pp. 129-132;
J. G. Wilpon, L. R. Rabiner, C. H. Lee, E. R. Goldman, "Automatic Recognition of Keywords in Unconstrained Speech Using Hidden Markov Models". IEEE Trans. on Acoustics, Speech and Signal Processing, Vol. 38, No. 11, November 1990, pp. 1870-1878;
L. R. Bahl, P. F. Brown, P. V. de Souza, R. L. Mercer, M. A. Picheny, "Acoustic Markov Models Used in the Tangora Speech Recognition System". Proc. of the Int. Conf. on Acoustics, Speech and Signal Processing, New York, N.Y., April 1988, pp. 497-500;
J. G. Wilpon, C. H. Lee and L. R. Rabiner, "Application of Hidden Markov Models for Recognition of a Limited Set of Words in Unconstrained Speech", Proceedings of the International Conference on Acoustics, Speech and Signal Processing, Vol. 1, pp. 254-257, Glasgow, Scotland, May 1989; the contents of all of the preceding papers being incorporated herein by reference.
The system of the invention involves using HMMs to model speaker utterances. Hidden Markov models consists of a set of states with associated outputs, where the output of a state is a feature vector describing a sound. Transition probabilities between states allow modeling a sequence of sounds. In this invention, the states for the HMMs correspond to clusters of sounds, or acoustic units. A keyword is modelled as a specific sequence of states, or acoustic units. Nonkeyword speech is modelled as an arbitrary sequence of these units.
Previous speaker-dependent wordspotting systems have been based on template matching using dynamic time warping, as described in the following papers:
R. W. Christiansen, C. K. Rushforth, "Detecting and Locating Key Words in Continuous Speech Using Linear Predictive Coding". IEEE Trans. on Acoustics, Speech and Signal Processing, Vol. ASSP-25, No. 5, October 1977, pp. 361-367;
A. L. Higgens, R. E. Wohlford, "Keyword Recognition Using Template Concatenation". Proc. of the Int. Conf. on Acoustics, Speech and Signal Processing, Tampa, Fla., March 1985, pp. 1233-1236;
C. S. Myers, L. R. Rabiner, A. E. Rosenberg, "An Investigation of the Use of Dynamic Time Warping for Word Spotting and Connected Speech Recognition". Proc. of the Int. Conf. on Acoustics, Speech and Signal Processing, Denver, Colo., April 1980, pp. 173-177.
While these techniques are applicable to wordspotting tasks, they are generally inferior to HMM's in modeling the acoustic variability associated with multiple repetitions of a keyword due to speaking rate, context, etc. HMM's also provide a more natural means of modeling non-keyword speech than do the filler templates used in the more sophisticated DTW-based systems. See the March 1985 Higgens et al paper.