A speech recognition system compares input speech parameters to word models in the form of state sequences. That is, each word in the system vocabulary is modeled as a sequence of connected states in which the states, the connections, or both are characterized by probability distributions of the speech parameters. During the recognition search, multiple recognition hypotheses are maintained, each hypothesis being predicated on: 1) the arrival of the input speech in a given state of a given word model, and 2) that a given sequence of words was spoken before that word. For the speech recognition system to operate at an acceptable speed, the number of active recognition hypotheses needs to be limited.
Forward-backward search is a commonly known technique for efficient speech recognition. A discussion of this subject matter appears in Chapter 12 of Deller, Proakis & Hansen, Discrete-Time Processing of Speech Signals (Prentice Hall, 1987), which is incorporated herein by reference. Forward-backward search employs a two-level approach to search a vast space of possible word sequences in order to assess which word sequence is most likely to have been spoken. In the forward search pass, relatively simple models are used to create a first set of word recognition hypotheses of words which could have been spoken, along with their associated occurrence probabilities. The backward search pass in the reverse direction uses more complex models which require greater computational resources. The number of possible word sequences considered by the backward search pass is limited by starting from the set of recognition hypotheses produced by the forward search pass. A forward-backward search algorithm, as described in the prior art, performs a forward search on an input utterance until the utterance ends, and then searches backwards from the end to the beginning of the utterance. This leads to a system in which the recognized words are presented only after the end of the complete utterance.
One approach utilizing a forward-backward search, described by Schwartz et al. in U.S. Pat. No. 5,241,619, which is hereby incorporated herein by reference, uses a forward search employing a relatively simple algorithm, followed by a backward search which performs a more complex word dependent n-best search. For a given state in a given word, Schwartz requires that different recognition hypotheses be maintained for different possible word histories. These recognition hypotheses form a monolithic set, which is limited to a certain maximum number. When the best recognition hypothesis in the set has a probability score which is outside a given offset from the probability score of the overall best recognition hypothesis of that speech frame, all of the recognition hypotheses in the set are removed in a single operation.
Thus, Schwartz describes a system with a two level state organization, with super-states that contain substates for different previous words. There are different mechanisms for limiting the number of super-states and the number of substates per super-state. The complexity of the state structure in Schwartz requires considerable computational time and resources.