A. Field of the Invention
The present invention relates to a speech recognition system and method. More particularly, the present invention relates to a speech recognition system that uses a search and rescoring method to recognize the speech.
B. Description of the Related Art
Speech recognition systems typically use a concatenation of phonemes to model the vocabulary of words spoken by a user. Phonemes are the basic units representing the sounds that comprise any given word and, therefore, depend upon the context in which they arise in a word. Allophones are context dependent phonemes and are often represented by Hidden Markov Models (HMM) comprising a sequence of states each having a transition probability. Thus, any word can be represented as a chain of concatenated HMM enabling speech to be modelled as a random walk through the HMM for a word.
To recognize an unknown utterance spoken by the user, the system must then compute the most likely sequence of states through the HMM. The well-known Viterbi method can be used to evaluate the most likely path by opening the HMM up into a trellis. The trellis has the same number of states as there are in the allophone model, and, thus, the total number of operations per frame for each trellis is proportional to the total number of transitions in the corresponding allophone model.
The Viterbi method, however, has two main problems. First, the method is computationally complex because it evaluates every transition at every node of the entire vocabulary network. For speech recognition systems having a medium to large vocabulary, this computation can be very burdensome and greatly increases the cost of the computer hardware. Second, the complexity of the Viterbi method allows the computation for only a single recognition result. This precludes the use of more accurate post processing to refine the recognition result.
While other approaches have been developed which provide more than one choice for the recognition result, they have problems as well. For example, the stack decoding disclosed by P. Kenny et al., "A* Admissible Heuristics for Rapid Lexical Access," Proceeding ICASSP, p. 689-92 (1991), provides alternative choices, but is ineffective when the heuristic partial path likelihoods are inaccurate. The approach disclosed by H. Ney et al., "Data Driven Organization of the Dynamic Programming Beam Search for Continuous Speech Recognition," IEEE Transactions on Signal Processing," Vol. SP-40, No. 2, p. 272-81, February 1992, provides an efficient method to limit the search space. However, this method suffers when the speech contains noise common in telephone applications.
Also known are methods which can be used for continuous word recognition. See R. Schwarthz et al., "The N-Best Algorithm: An Efficient and Exact Procedure for FICES Finding the N Most Likely Sentence Hypothesis," IEEE ICASSP-90, p. 81-84,
Albuquerque, April 1990; V. Steinbiss, "Sentence-Hypothesis Generation in a Continuous-Speech Recognition System," Proc. EuroSpeech-89, Vol. 2, p. 51-54, Paris, September 1989; L. Nguyen et al., "Search Algorithm for Software-Only Real-Time Recognition with Very Large Vocabularies," Proceedings of ARPA Human Language Technology Workshop, p. 91-95, Plainsboro, N.J., March 1993. These methods, however, are not useful in recognizing strings of unrelated words, such as strings describing a location, a company name or person's name.
U.S. Pat. No. 5,195,167 (the '167 patent) describes a fast-match search which reduces the number of computations performed by the Viterbi method. The '1167 patent teaches replacing each HMM transition probability at a certain time with the maximal value over the associated allophone. U.S. Pat. No. 5,515,475 discloses a two-pass search method that builds upon the idea of the '167 patent. The first pass identifies the N most likely candidates for the spoken word, the N most likely hypotheses, using a one-state allophone model having a two frame minimum duration. The second pass then decodes the N hypothesis choices using the Viterbi algorithm. The memory and processing time requirements of this approach, however, are unsatisfactory for some applications. Therefore, there is a demand for a speech recognition system which can operate at a high speed without requiring an excessive amount of memory or a high speed processor.