1. Field of the Invention
The present invention relates to speech recognition in general and, specifically, to word hypothesis in continuous speech decoding using HMM phone models.
2. Description of Related Art
The tremendous complexity that exists in continuous speech decoding processes is due to the uncertainty of word identities and their locations in an utterance. As vocabulary size and complexity increase, the decoding process usually becomes extremely computationally expensive. A two-pass decoding strategy, which employs a fast acoustic match to prepare a partial or complete word hypothesis lattice over an utterance in the first pass and a detailed decoding guided by the word lattice in the second pass, can significantly reduce the search complexities. There have been a few efforts in this direction. Micro-segments of broad classification units were used in generating word hypothesis lattice; see L. Fissore et al., "Interaction Between Fast Lexical Access and Word Verification in Large Vocabulary Continuous Speech Recognition," Proc. ICASSP, pp. 279-282, New York, N.Y., 1988. Broad classification-based acoustic match was used to constrain search path by looking ahead a context size of a phoneme or word; see X. L. Aubert et al., "Fast Look-Ahead Pruning Strategies in Continuous Speech Recognition," Proc. ICASSP, pp. 659-662, Glasgow, Scotland, 1989. A statistical grammar-guided backward pass over the entire sentence was used to generate partial path scores and word candidates for a detailed forward N-best decoding; see S. Austin et al., "The forward-Backward Search Algorithm," Proc. ICASSP, pp. 697-700, Toronto, Canada, 1991.
In spite of its complicated nature, the speech signal exhibits prominent feature regions of high energy vowels. The vowels represent syllabic nuclei. Using the vowel centers as anchor points of syllables or demisyllables, the task of continuous speech decoding may be accomplished with reduced complexity. An effort in this direction can be found in work done for continuous speech recognition in the German language; see W. Weigel et al., "Continuous Speech Recognition with Vowel-Context-Independent Hidden Markov Models for Demisyllables," Proc. ICSLP, pp. 701-704, Kobe, Japan, Nov. 1990. The structure of lexicon representation is also a factor in the speed of decoding, where a tree structure is a more efficient representation than a linear structure; see L. R. Bahl et al., "A Fast Approximate Acoustic Match for Large Vocabulary Speech Recognition," Proc. EuroSpeech, pp. 156-158, Paris, France, Sep. 1989.
In spite of these research efforts, there still exists a need for a word hypothesis module for use with continuous speech recognition systems which will process complex tasks and large vocabularies without a concomitant increase in computational expense.