There has long been a desire to have machines capable of responding to human speech, such as machines capable of obeying human commands and machines capable of transcribing human dictation. Such machines would greatly increase the speed and ease with which humans communicate with computers and the speed and ease with which they record and organize their own words and thoughts.
Due to recent advances in computer technology and speech recognition algorithms, speech recognition machines have begun to appear in the past several decades, and have become increasingly more powerful and less expensive. For example, the assignee of the present application has publically demonstrated speech recognition software which runs on popular personal computers and which requires little extra hardware except for an inexpensive microphone, an analog-to-digital converter, and a relatively inexpensive microprocessor to perform simple signal processing. This system is capable of providing speaker dependent, discrete word recognition for vocabularies of up to two thousand words at any one time, and many of its features are described in U.S. patent application Ser. No. 797,249, U.S. Pat. No. 4,783,803 entitled "Speech Recognition Apparatus and Method", which is assigned to the assignee of the present application, and which is incorporated herein by reference.
Most present speech recognition systems operate by matching an acoustic description of words in their vocabulary against an acoustic description of an utterance to be recognized. In many such systems, the acoustic signal generated by the utterance to be recognized is converted by an A/D converter into a digital representation of the successive amplitudes of the audio signal created by the speech. Then that signal is converted into a frequency domain signal which consists of a sequence of frames, each of which gives the amplitude of the speech signal in each of a plurality of frequency bands. Such systems commonly operate by comparing the sequence of frames produced by the utterance to be recognized with a sequence of nodes, or frame models, contained in the acoustic model of each word in their vocabulary.
Originally the performance of such frame matching systems was poor, since the sounds of a given word are rarely, if ever, spoken in exactly the same speed or manner. However, two major techniques have been developed in the prior art which have greatly improved the performance of such systems. The first is probabilistic matching, which determines the likelihood that a given frame of an utterance corresponds to a given node in an acoustic word model. It does this not only as a function of how closely the amplitudes of the frame's individual frequency bands match the expected frequencies of the given node, but also as a function of how the deviation between the actual and expected amplitudes compares to the expected deviations for such values. Such probabilistic matching provides a much greater ability to deal with the variations which occur in different utterances of the same word, and a much greater ability to deal with the noise commonly present during speech recognition tasks.
The second major technique which greatly improves the performance of such frame matching systems is that of dynamic programming. Stated simply, dynamic programming provides a method to find an optimal or near optimal match between the sequence of frames produced by an utterance and the sequence of nodes contained in the model of a word. It does this by effectively expanding and contracting the duration of each node in the acoustic model of a word to compensate for the natural variations in the durations of speech sounds which occur in different utterances of the same word. A more detailed discussion of the application of dynamic programming to speech recognition is contained in the above mentioned application Ser. No. 797,249, and in J. K. Baker's article entitled "Stochastic Modeling for Automatic Speech Recognition" in the book Speech Recognition edited by D. R. Reddy and published by Academic Press, New York, N.Y., in 1975.
A major problem in speech recognition is that of reducing the tremendous amount of computation it requires, so that recognition can be preformed in a reasonable time on relatively inexpensive computer hardware. Since many speech recognition systems operate by comparing a given spoken utterance against each word in its vocabulary, and since each such comparison can require thousands of computer instructions, the amount of computation required to recognize an utterance tends to grow as does the vocabulary. Thus the problem of making speech recognition computationally efficient is made even more difficult in systems designed to recognize the large vocabularies necessary to make speech recognition useful for the transcription of normal language.
The prior art has developed a variety of methods for dealing with the excessive computational demands introduced by large vocabulary recognition. One such method used is to provide the system with an artificial grammer which limits the vocabulary which the system can recognize at any one time to a sub-set of the overall vocabulary. As word phrases are recognized, their grammatical classification in an artifical grammer are determined and used to advance the grammer to another state in which another sub-vocabulary of words can be recognized. Although this technique does an excellent job of reducing the system's computational demands, it prevents users from speaking in a natural manner.
Another prior art technique for making large vocabulary recognition more efficient is that of "pruning". Generally speaking, pruning involves reducing the number of cases which a program considers, by eliminating from further consideration those cases which, for one reason or another, do not appear to warrant further computation. For example, in the system described in the above mentioned application Ser. No. 797,249, the dynamic programming algorithm produces a score for each word in its active vocabulary after each frame of an utterance. This score corresponds to the likelihood that the frames received so far match the given word. After the score for each word in the active vocabulary is updated, it is compared with the best score produced for any word. If the difference between the score for a given word and the best score exceeds a certain threshold, that given word is removed, or pruned, from the active vocabulary and future frames are no longer compared against it. This technique greatly improves the computational efficiency, since it enables poorly scoring words to be removed from consideration long before all of the frames of an utterance have been processed.
The system described in the above mentioned application Ser. No. 797,249 further reduces computational demands and the likelihood of confusion by using a language model. Such a language model predicts the relative likelihood of the occurrence of each word in the system's vocabulary, given the word spoken before it. Such language models make use of the fact that in human language the likelihood that a given word will be spoken is greatly influenced by the context of the one or more words which precede it. Language model probabilities are calculated by analyzing a large body of text and determining from it the number of times that each word in the vocabulary is preceded by each other word in the vocabulary.
The system described in the above mentioned application Ser. No. 797,249 further reduces computation by prefiltering its vocabulary words. This prefiltering runs a superficial recognition against a vocabulary to quickly select those of its words which appear similar enough to the utterance to be recognized to warrant a more detailed comparison with that utterance.
Although these and other previously developed methods greatly reduce the computation required for speech recognition, there still is a need to further reduce such computation if present day personal computers are to be capable of recognizing large vocabularies, such as vocabularies of twenty-thousand words or more, without the addition of expensive computational hardware.
Another problem encountered with prior art speech recognition systems, particularly those attempting to deal with relatively large vocabularies, is that recognition performance is far from foolproof. For this reason, it is desirable to create methods by which an operator can indicate to the system whether or not its attempted recognition is correct, and if not, by which he can correct the mistake as easily as possible. The above mentioned U.S. patent application Ser. No. 797,249 discloses means for displaying a list of a recognition's best scoring word candidates, in order of their score, and means for enabling the operator to select any of the displayed words by typing a number associated with it, or to select the best scoring word by speaking another word to be recognized. Although this system works well in the real time dictation of discrete words, it does not address the issue of correcting errors in previously dictated speech or in continous speech. Also when this system is used with large vocabularies, the amount of time required before the system displays any words for the operator to chose increases with the size of the recognition vocabulary, and in large vocabulary systems can be annoyingly slow.