A major problem in speech recognition is that of reducing the tremendous amount of computation which such recognition requires. This is desirable so that such recognition can be performed in a reasonable amount of time by relatively inexpensive computers. Since many speech recognition systems operate by comparing a given spoken utterance against each word in its vocabulary, and since each such comparison can require tens of thousands of computer instructions, the amount of computation required to recognize speech tends to grow in proportion to the vocabulary size. This problem is particularly difficult in systems designed to handle the large vocabularies required to recognize normal speech.
Many speech recognition systems use some form of "dynamic programming", or "DP", algorithm. Typically such systems represent speech as a sequence of frames, each of which represents the speech during a brief period of time, such as a fiftieth or a hundredth of a second. Such systems normally model each vocabulary word with a sequence of node models which represent the sequence of different types of frames associated with that word. At recognition time the DP, in effect, slides forward and backward, and expands and contracts, the node models of each vocabulary word relative to the frames of the speech to find a relatively optimal time alignment between those nodes and those frames. The DP calculates the probability that a given sequence of frames matches a given word model as a function of how well each such frame matches the node model with which it has been time aligned. The word model which has the highest probability scores is selected as corresponding to the speech. DP has greatly improved speech recognition. Its ability to obtain a relatively optimal time alignment between the speech to be recognized and the nodes of each word model compensates for the unavoidable differences in speaking rates which occur in different utterances of the same word. In addition, since DP scores words as a function of the fit between word models and the speech over many frames, it usually gives the correct word the best score, even if the word has been slightly misspoken or obscured by background noises. This is extremely important, because humans often mispronounce words, either by deleting or mispronouncing their proper sounds, or inserting sounds which do not belong in them, and because some form of background noises is unavoidable in most environments in which speech recognition is likely to be used.
DP has a major drawback, however. It requires a tremendous amount of computation. In order for it to find the optimal time alignment between a sequence of frames and a sequence of node models, it has to compare most frames against a plurality of node models. One method of reducing the amount of computation required for DP is to use pruning. Pruning terminates the DP of a given portion of speech against a given word model if the partial probability score for that comparison drops below a given threshold. This greatly reduces computation, since the DP of a given portion of speech against most words produces poor DP scores rather quickly, enabling most words to be pruned after only a small percent of their comparison has been performed. Unfortunately, however, even with such pruning, the amount of computation required in large vocabulary system of the type necessary to transcribe normal dictation is still prohibitively large for present day personal computers.
If the speech to be recognized is continuous speech the computational requirements are even greater. In continuous speech, the type of which humans normally speak, words are run together, without pauses or other simple ques to indicate where one word ends and the next begins. Most humans are unaware of this because our minds are so good at speech recognition that we divide continuous speech into its constituent words without consciously thinking of it. But when a mechanical speech recognition system attempts to recognize continuous speech, it initially has no way of knowing which portions of speech correspond to individual words. Thus it initially has no idea of which portions of speech to compare against the start of word models.
One approach to this problem is to treat each successive frame of the speech as the possible beginning of a new word, and to start performing DP at each such frame against the start of each vocabulary word. But this would require a tremendous amount of computation. A more efficient method used in the prior art only starts DP against new words at those frames for which the DP indicates that the speaking of a previous word has just ended. Although this method is a considerable improvement, there is a need to reduce computation even further by reducing the number of words against which DP is started when there is indication that a prior word has ended.
One such method of reducing the number of vocabulary words against which dynamic programming is started in continuous speech recognition was developed by the inventor of the present invention while formerly employed at IBM. This method associated with each frame of the speech to be recognized a phonetic label which identifies which of a plurality of phonetic frame model compares most closely to that frame. Then it divided the speech into segments of successive frames associated with a single phonetic label. For each given segment, it takes the sequence of five phonetic labels associated with that segment and the next four segments, and goes to a look up table and finds the set of vocabulary words which have been previously determined to have a reasonable probability of starting with that sequence of phonetic labels. It then limits the words against which dynamic programing could start in the given segment to words in that set.
Although this method greatly reduced computation, the look up table it required used too much memory to make the method practical.
Other schemes have been used for reducing the number of vocabulary words against which dynamic programming is performed in discrete, as opposed to continuous, speech recognition. Such prefiltering schemes generally perform a superficial analysis of the separately spoken word to be recognized, and, from that analysis, select a relatively small subset of the vocabulary words as candidates for DP. One such method is disclosed in U.S. Patent application Ser. No. 797,249, filed by Baker et al. on Nov. 12th, 1985 and entitled "Speech Recognition Apparatus and Method" (hereinafter referred to as Application 797,249). Application 797,249 has been assigned to the assignee of the present application and is incorporated herein by reference. It discloses a method of prefiltering which compares three sets of averaged frame models from the beginning of a separate word to be recognized against a corresponding three sets of averaged frame models from each of the vocabulary words. Based on this comparison, it selects which vocabulary words appear similar enough to the speech to warrant a more detailed comparison.
Although the prefiltering of Application 797,249 significantly improves recognition speed, the embodiment of the prefiltering scheme disclosed in that application is not designed for the recognition of continuous speech. In addition, that prefiltering scheme uses linear time alignment to compare a sequence of models from the speech to be recognized against a sequence of models for each vocabulary word. Unlike dynamic programming, linear time alignment does not stretch or compress one sequence so as to find a relatively optimal match against another. Instead it makes its comparison without any such stretching or compression. Its benefit is that it greatly reduces computation, but its drawback is that its comparisons tend to be much less tolerant of changes in the speaking rate, or of insertions or deletions of speech sounds, than comparisons made by dynamic programming. As a result prefiltering schemes which used linear time alignment tend to be less accurate than desired.
In addition to prefiltering of the type described in Application 797,249, which makes a superficial comparison against each word in the system vocabulary, the prior art has used lexical retrieval to reduce the number of vocabulary words against which an utterance has to be compared. In lexical retrieval information from the utterance to be recognized generates a group of words against which recognition is to be performed, without making a superficial comparison against each vocabulary word. In this application, the term "prefiltering" will be used to include such lexical retrieval.
The HEARSAY speech recognition program developed at Carnegie-Mellon University in the early 1970's used lexical retrieval. It had acoustic models of most syllables which occur in English. When an utterance to be recognized was received, it was compared against these syllable models, producing a list of syllables considered likely to occur in the utterance to be recognized. Then words containing those syllables were then chosen for comparison against the utterance to be recognized.
Speech recognition programs written at Bolt, Beranek, and Newman, have performed lexical retrieval by mapping all vocabulary words onto a common tree, in which branches correspond to phonemes. The root of the tree is the start of the word. Its first branches represent all the different initial phonemes contained in the vocabulary words. The second level branches connected to a given first level branch represent all the second phonemes in the vocabulary words which follow the first phoneme represented by the given first level branch. This is continued for multiple levels, so that words which start with a similar string of phonemes share a common initial path in the tree. When an utterance to be recognized is received, its successive parts are compared with a set of phoneme models, and the scores resulting from those comparisons are used to select those parts of the tree which probably correspond to the word to be recognized. The vocabulary words associated with those parts of the tree are then compared in greater detail against the word to be recognized. Another method of lexical retrieval is disclosed in U.S. patent application Ser. No. 919,885, filed by Gillick et al. on Oct. 10th, 1986 and entitled "A Method For Creating And Using Multiple-Word Sound Models in Speech Recognition" (hereinafter referred to as Application 919,885). Application 919,885 has been assigned to the assignee of the present application, and is incorporated herein by reference. It discloses a method of prefiltering which uses linear time alignment to compare a sequence of models from the speech to be recognized against a corresponding sequence of models associated with each of a plurality of word-start cluster models. The word-start clusters are derived by dividing the sequences of acoustic models associated with the start of all of the system's vocabulary words into groups, or clusters, of relatively similary model sequences. Each of the resulting word-start clusters has a sequence of models calculated for it which statistically models the sequence of models placed within it, and this sequence of models form the word-start cluster model for that cluster. The use of word-start cluster models greatly reduces computation. It does this because it enables the system to determine if a given portion of the speech to be recognized is similar to the start of a whole group of words by comparing that portion of speech against the word-start cluster model for that group, rather than requiring a comparison against a separate model representing each word in that group.
The prefiltering method described in Application 919,885 provides good prefiltering, but the embodiment of that method shown is designed for separately spoken words, rather than continuous speech.