Speech recognition systems are known which permit a user to interface with a computer system using spoken language. A speech recognition system receives spoken input from the user, interprets the input, and then translates the input into a form that the computer system understands. More particularly, spoken input in the form of an analog waveform is digitally sampled. The digital samples are then processed by the speech recognition system according to a speech recognition algorithm. Speech recognition systems typically recognize and identify words or utterances of the spoken input by comparison to previously obtained templates of words or utterances or by comparison to a previously obtained acoustic model of a person who is speaking. The templates and acoustic model are typically generated based upon samples of speech.
An example of a known speech recognition technique is word-level template matching. During word-level template matching, the spoken input is compared to pre-stored templates which represent various words. A template which most closely matches the spoken input is selected as the output. Another example of a known speech recognition technique is acoustic-phonetic recognition. According to acoustic-phonetic recognition, the spoken input is segmented and identified according to basic units of speech sound (phonemes). The results of segmentation and identification are then compared to a pre-stored vocabulary of words. The word or words which most closely match the spoken input are selected as the output.
Yet another example of a known speech recognition technique is stochastic speech recognition. According to stochastic speech recognition, the spoken input is converted into a series of parameter values which are compared to pre-stored models. For example, the pre-stored models can be Hidden Markov Models (HMMs) that use Gaussian Mixture Models (GMMs) to model short-term acoustic observation probabilities. The GMMs and HMMs are obtained for phonemes by taking samples of spoken words or sentences and then representing the speech as parameter values which take into account statistical variation between different samples of the same phoneme. Probabilistic analysis is utilized to obtain a best match for the spoken input. Known algorithms for probabilistic analysis are the Baum-Welch maximum likelihood algorithm and the Viterbi algorithm.
A typical characteristic of such known speech recognition systems is contention between processing time and recognition accuracy. Thus, a speech recognition system which is pre-configured for an acceptable level of accuracy is often accompanied by unacceptable delay or processing power requirements to recognize speech, whereas, a speech recognition system which is pre-configured for an acceptable speed of recognition often exhibits unacceptable error levels.
A contemplated solution to this contention between recognition speed and accuracy has been two-pass speech recognition. A two-pass speech recognition system processes spoken input according to two speech recognition algorithms in succession. FIG. 1 illustrates flow diagram for a two-pass speech recognition system according to the prior art. Program flow begins in a start state 100. Then program flow moves to a state 102 where spoken input is received. During a first pass in a state 104, spoken input is processed according to a high speed, but relatively low accuracy, speech recognition technique. This first pass produces several alternative matches for the spoken input. During a second pass in a state 104, a low speed, but relatively high accuracy, speech recognition technique is utilized to select one of the alternatives produced by the first pass. The results are outputted in a state 108 and, then, program flow terminates in a state 110. Because the second pass performed in the state 104 operates on a limited number of alternatives, the second pass was not expected to unduly delay or require undue processing power to perform the speech recognition process. In practice, however, for a given accuracy, the total processing time required by such two-pass systems tends to be longer than desired.
Similarly, U.S. Pat. No. 5,515,475, issued to Gupta et al., describes a two-pass speech recognition method in which a first pass is performed and, then, a second pass is performed. For a given accuracy, the total processing time required by the two passes also tends to be longer that desired.
Therefore, what is needed is a technique for increasing recognition speed while maintaining a high degree of recognition accuracy in a speech recognition system.