There has long been a desire to have machines capable of responding to human speech, such as machines capable of obeying human commands and machines capable of transcribing human dictation. Such machines would greatly increase the speed and ease with which humans could communicate with computers, and greatly speed and ease the ability with which humans could record and organize their own words and thoughts.
Due to recent advances in computer technology, as well as recent advances in the development of algorithms for the recognition of speech, speech recognition machines have begun to appear in the past several decades, and have begun to become increasingly more powerful and increasingly less expensive. For example, the assignee of the present application has previously marketed speech recognition software which runs on popular personal computers and which requires little extra hardware except for an inexpensive microphone, an analog-to-digital converter, and a relatively inexpensive microprocessor to perform simple signal processing. This system is capable of providing speaker dependent, discrete word recognition for vocabularies of up to sixty-four words at any one time.
Most present speech recognition systems operate by matching an acoustic description of words in their vocabulary against a representation of the acoustic signal generated by the utterance of the word to be recognized. In many such systems, the acoustic signal generated by the speaking of the word to be recognized is converted by an A/D converter into a digital representation of the successive amplitudes of the audio signal created by the speech. Then that signal is converted into a frequency domain signal which consists of a sequence of frames, each of which gives the amplitude of the speech signal in each of a plurality of frequency bands. Such systems commonly operate by comparing the sequence of frames produced by the utterance to be recognized with a sequence of nodes, or frame models, contained in the acoustic model of each word in their vocabulary.
Originally the performance of such frame matching systems was relatively poor, since the individual sounds which make up a given word are seldom, if ever, spoken at exactly the same relative rate or in exactly the same manner in any two utterances of that word. However, two major techniques have been developed in the prior art which have greatly improved the performance of such systems. The first is probabilistic matching, which determines the likelihood that a given frame of an utterance corresponds to a given node in an acoustic word model. It determines this likelihood not only as a function of how closely the amplitudes of the individual frequency bands of the frame match the expected frequencies contained in the given nodes, but also as a function of how the deviation between the actual and expected amplitudes compares to the expected deviations for such values. Such probabilistic matching gives a recognition system a much greater ability to deal with the variations in speech sound which occur in different utterances of the same word, and a much greater ability to deal with the noise which is commonly present during speech recognition tasks.
The second major technique which greatly improves the performance of such frame matching systems is that of dynamic programming. Stated simply, dynamic programming provides a method to find an optimal or near optimal match between the sequence of frames produced by an utterance and the sequence of nodes contained in the model of a word. It does this by effectively expanding and contracting the duration of each node in the acoustic model of a word to compensate for the natural variations in the durations of speech sounds which occur in different utterances of the same word. A more detailed discussion of the application of dynamic programming to speech recognition is available in J. K. Baker's article entitled "Stochastic Modeling for Automatic Speech Recognition" in the book Speech Recognition edited by D. R. Reddy and published by Academic Press, New York, N.Y., in 1975.
The performance of present speech recognition systems is impressive when compared to the similar systems of a short time ago. For example, the above mentioned system of the assignee of the present application, which runs on popular personal computers, is equal in performance to systems costing several tens of thousands of dollars only five years ago. Nevertheless, there is a great need to improve further the performance of speech recognition systems before they will find the large scale use of which they are ultimately capable. In particular there is a need to provide systems capable of recognizing words from much larger vocabularies than those which can be reliably handled by most present systems.
Unfortunately the performance of present speech recognition systems tends to deteriorate considerably as the size of the vocabulary which they are capable of recognizing at any one time increases. This results from two major factors. First, many speech recognition systems operate by comparing a given spoken utterance against each word in its vocabulary. Since each such comparison can require thousands of computer instructions, the amount of computation required to recognize an utterance grows as does the vocabulary. This increase in computation has been a major problem in the development of large vocabulary systems.
The second major problem in the development of a large vocabulary systems is caused by the fact that as a vocabulary grows, the number of words that are similar in sound also tends to grow. As a result, there is an increased likelihood that an utterance of a given word from the vocabulary will be misrecognized as corresponding to another similar sounding word from the vocabulary.
The prior art has developed a variety of methods for dealing with the excessive computational demands and the increased likelihood of word confusion introduced by large vocabulary recognition. One such method is that of "pruning", a common computer technique used to reduce computation. Generally speaking, pruning involves reducing the number of cases which a program considers, in part or in full, by eliminating from further consideration those cases which, for one reason or another, do not appear to warrant further computation.
For example, in the above mentioned limited vocabulary recognition system of the assignee of the present application, the dynamic programming algorithm produces a score for each word in its active vocabulary after each frame of an utterance. This score corresponds to the likelihood that the frames received so far match the given word. After the score for each word in the active vocabulary is updated, it is compared with the best score produced for any word. If the difference between the score for a given word and the best score exceeds a certain threshold, that given word is removed, or pruned, from the active vocabulary and future frames are no longer compared against it. This technique greatly improves the computational efficiency, since it enables poorly scoring words to be removed from consideration long before all of the frames of an utterance have been processed. As powerful as this technique is, however, there nevertheless is a need to find even more efficient techniques to enable speech recognition systems to operate with large vocabularies.
A technique used to reduce both the computational demands and the likelihood of confusion in large vocabulary systems is that of using a language model. Such language models predict the relative likelihood of the occurrence of each word in the system's vocabulary, given the one or more words which have been spoken previously. Such language models make use of the fact that in human language the likelihood that a given word will be spoken is greatly influenced by the context of the one or more words which precede it. For example, speech recognition systems have been developed which use digram language models. Digrams give, for each word in the vocabulary, the likelihood of the occurrence of that word given the immediately preceding occurrence of any other word in the vocabulary. The values for each digram is usually determined by analyzing a large body of text and determining from that text the number of times that each word in the vocabulary is preceded by each other word in the vocabulary. Some language model systems have even used trigrams, which give for each word in the vocabulary the probability of its occurrence given any combination of two preceding words of the vocabulary.
It has been shown that such digram and trigram language models provide a valuable source of probabilistic information as to the identity of a particular utterance given its context of previous utterances. In one prior art system the probabilistic scores produced by such digram and trigram language models are combined with probabilistic word scores based on the phonemes, or basic speech sounds, which the system has detected by matching phoneme models against the signal produced by utterances. Then words with combined scores better than a certain threshold have their corresponding word model matched against the acoustic frames of the utterance.
Another technique which has been used to reduce the massive computation required in large vocabulary word recognition is that of hypothesis and test, which is, in effect, a type of pruning. Under hypothesis and test, the acoustic signal is observed for the occurence of certain features or patterns, such as the identity of apparent phonemes, as is described in the paragraph above. When such given features or patterns are observed, they are used to form a hypothesis that the word actually spoken corresponds to a subset of words from the original vocabulary which commonly have the observed features. Then the speech recognition system proceeds by performing a more lengthy match of each word in this sub-vocabulary against the acoustic signal.
Another important approach toward dealing with the large computational demands of speech recognition, both in large and small vocabulary system, is the development of special purpose hardware to increase greatly the speed of such processing. In the past, for example, special-purpose processors have been made to perform the probabilistic frame matching, or likelihood, computation described above. Such special purpose likelihood processors have greatly increased the performance of frame matching systems, but most such special purpose processors developed in the past have been quite complicated because of their inclusion of hardware multiplication circuitry to perform likelihood computations at high speeds.
In addition to the problems of increased computational demand and increased chance of word confusion, there are a host of other problems which have been encountered in the prior art of speech recognition. One such problem is that of determining exactly when an utterance has begun. This problem is made more difficult by the fact that in most speaking environments there is background noise. Thus the system must seek to distinguish between background noise and the commencement of speech sounds. One prior art method for attempting to make this distinction is used in the recognition system presently being sold by the assignee of the present application. This system uses a statistically derived acoustic model of background noise. When it hears sounds corresponding to this model it assumes speech has not yet begun. Although the performance of this system is good compared to many other systems, its performance in the presence of changing background sounds still leaves considerable room for improvement. A somewhat similar problem relates to the fact that even if the system has a good method for distinguishing between normal background sounds and speech sounds, it is nevertheless likely to occasionally be falsely triggered into interpreting non-speech sounds as speech. For example, humans often make sounds immediately preceding intended speech by smacking their lips. In addition, brief background sounds can be similar both in volume and spectral characteristics to speech sounds.
One method of dealing with this problem has been presented in the above mentioned system previously marketed by the assignee of the present application. That method involves producing a startscore representing the probability that the actual utterance of a word has not yet begun. When the system detects what it considers to be the start of an utterance, it starts a dynamic programming process. This process matches each successive frame of acoustic data against each word in its active vocabulary. In doing so, it removes from the active vocabulary, after each frame, those words with match scores worse than a certain threshold. If, after starting such a process, the startscore becomes better than a certain threshold, indicating that the previous sounds were a false alarm and that the utterance of a word has not yet begun, the active vocabulary is reinitialized to contain all the words in the system's vocabulary, and the system recommences to perform dynamic programming against the initial nodes of all such words. Although this method improves performance, it is designed for use on a limited vocabulary system in which it is easy to perform dynamic programming against the initial nodes of the system' s entire vocabulary. Unfortunately, such a method would not be practical in a large vocabulary system.
Another problem encountered with prior art speech recognition systems, particularly those attempting to deal with relatively large vocabularies, is that recognition performance is far from perfect. For this reason, it is desirable to create methods by which an operator can indicate to the system whether or not its attempted recognition is correct, and if not, by which he can correct the mistake as easily as possible. One such method used in the prior art is for the system to have a word in its vocabulary the utterance of which indicates that the last spoken word is incorrectly recognized and is to be deleted from display on the screen. The user is then free to repeat the intended word again, giving the system another chance to recognize it correctly. Although such a system does greatly improve the ease with which a misrecognized word can be corrected, it fails to take advantage of the fact that in most instances in which the system misrecognizes a word, it actually scores the correct word for the utterance as its second, third, or fourth choice.
Other systems in the prior art have had the system display or repeat back to the user its understanding of the word or words which have been spoken. The system then requires the user to confirm that the word or words recognized are correct, either by saying a word, such as "yes", or by pressing a keyboard key. Such a system does provide a means of confirming that the system's recognition is correct, but it places a considerable burden on the operator by requiring him to confirm the system's recognition attempts even when they are correct.