A major problem in speech recognition is that of reducing the tremendous amount of computation which such recognition requires. This is desirable so that such recognition can be performed in a reasonable amount of time by relatively inexpensive computers. Since many speech recognition systems operate by comparing a given spoken utterance against each word in its vocabulary, and since each such comparison can require tens of thousands of computer instructions, the amount of computation required to recognize speech tends to grow in proportion to the vocabulary size. This problem is particularly difficult in systems designed to handle the large vocabularies required to recognize normal speech.
Many speech recognition systems use some form of "dynamic programming", or "DP", algorithm. Typically such systems represent speech as a sequence of frames, each of which represents the speech during a brief period of time, such as a fiftieth or a hundredth of a second. Such systems normally model each vocabulary words with a sequence of node models which represent the sequence of different types of frames associated with that word. At recognition time the DP, in effect, slides forward and backward, and expands and contracts, the node models of each vocabulary word relative to the frames of the speech to find a relatively optimal time alignment between those nodes and those frames. The DP calculates the probability that a given sequence of frames matches a given word model as a function of how well each such frame matches the node model with which it has been time aligned. The word model which has the highest probability scores is selected as corresponding to the speech.
DP has greatly improved speech recognition. Its ability to obtain a relatively optimal time alignment between the speech to be recognized and the nodes of each word model compensates for the unavoidable differences in speaking rates which occur in different utterances of the same word. In addition, since DP scores words as a function of the fit between word models and the speech over many frames, it usually gives the correct word the best score, even if the word has been slightly misspoken or obscured by background noise. This is extremely important, because humans often mispronounce words, either by deleting or mispronouncing their proper sounds, or inserting sounds which do not belong in them, and because some form of background noises is unavoidable in most environments in which speech recognition is likely to be used.
DP has a major drawback, however. It requires a tremendous amount of computation. In order for it to find the optimal time alignment between a sequence of frames and a sequence of node models, it has to compare most frames against a plurality of node models. One method of reducing the amount of computation required for DP is to use pruning. Pruning terminates the DP of a given portion of speech against a given word model if the partial probability score for that comparison drops below a given threshold. This greatly reduces computation, since the DP of a given portion of speech against most words produces poor DP scores rather quickly, enabling most words to be pruned after only a small percent of their comparison has been performed. Unfortunately, however, even with such pruning, the amount of computation required in large vocabulary system of the type necessary to transcribe normal dictation is still prohibitively large for present day personal computers.
If the speech to be recognized is continuous speech the computational requirements are even greater. In continuous speech, the type of which humans normally speak, words are run together, without pauses or other simple ques to indicate where one word ends and the next begins. Most humans are unaware of this because our minds are so good at speech recognition that we divide continuous speech into its constituent words without consciously thinking of it. But when a mechanical speech recognition system attempts to recognized continuous speech, it initially has no way of knowing which portions of speech correspond to individual words. Thus it initially has no idea of which portions of speech to compare against the start of word models.
One approach to this problem is to treat each successive frame of the speech as the possible beginning of a new word, and to start performing DP at each such frame against the start of each vocabulary word. But this would require a tremendous amount of computation. A more efficient method used in the prior art only starts DP against new words at those frames for which the DP indicates that the speaking of a previous word has just ended. Although this method is a considerable improvement, there is a need to reduce computation even further by reducing the number of words against which DP is started when there is indication that a prior word has ended.
One such method of reducing the number of vocabulary word against which dynamic programming is started in continuous speech recognition was developed by the inventor of the present invention while formerly employed at IBM. This method associated with each of frame of the speech to be recognized a phonetic label which identifies which of a plurality of phonetic frame model compares most closely to that frame. Then it divided the speech into segments of successive frames associated with a single phonetic label. For each given segment, it takes the sequence of five phonetic labels associated with that segment and the next four segments, and goes to a look up table and finds the set of vocabulary words which have been previously determined to have a reasonable probability of starting with that sequence of phonetic labels. It then limits the words against which dynamic programing could start in the given segment to words in that set.
Although this method greatly reduced computation, the look up table it required used too much memory to make the method practical.
Other schemes have been used for reducing the number of vocabulary words aganst which dynamic programming is performed in discrete, as opposed to continuous, speech recognition. Such prefiltering schemes generally perform a superficial analysis of the separately spoken word to be recognized, and, from that analysis, select a relatively small subset of the vocabulary words as candidates for DP. One such method is disclosed in U.S. patent application Ser. No. 797,249, filed by Baker et al. on Nov. 12, 1985 and entitled "Speech Recognition Apparatus and Method" (hereinafter referred to as Application Ser. No. 797,249). Application Ser. No. 797,249 has been assigned to the assignee of the present application and is incorporated herein by reference. It discloses a method of prefiltering which compares three sets of averaged frame models from the beginning of a separate word to be recognized against a corresponding three sets of averaged frame models from each of the vocabulary words. Based on this comparison, it selects which vocabulary words appear similar enough to the speech to warrant a more detailed comparison.
Although the prefiltering of Application Ser. No. 797,249 significantly improves recognition speed, the embodiment of the prefiltering scheme disclosed in that application is not designed for the recognition of continuous speech. In addition, that prefiltering scheme uses linear time alignment to compare a sequence of models from the speech to be recognized against a sequence of models for each vocabulary word. Unlike dynamic programming, linear time alignment does not stretch or compress one sequence so as to find a relatively optimal match against another. Instead it makes its comparison without any such stretching or compression. Its benefit is that it greatly reduces computation, but its drawback is that its comparisons tend to be much less tolerant of changes in the speaking rate, or of insertions or deletions of speech sounds, than comparisons made by dynamic programming. As a result prefiltering schemes which used linear time alignment tend to be less accurate than desired.
In addition to prefiltering of the type described in Application Ser. No. 797,249, which makes a superficial comparison against each word in the system vocabulary, the prior art has used lexical retrieval to reduce the number of vocabulary words against which an utterance has to be compared. In lexical retrieval information from the utterance to be recognized generates a group of words against which recognition is to be performed, without making a superficial comparison against each vocabulary word. In this application, the term "prefiltering" will be used to include such lexical retrieval.
The HEARSAY speech recognition program developed at Carnegie-Mellon University in the early 1970's used lexical retrieval. It has acoustic models of most syllables which occur in English. When an utterance to be recognized was received, it was compared against these syllable models, producing a list of syllables considered likely to occur in the utterance to be recognized. Then words containing those syllables were then chosen for comparison against the utterance to be recognized.
Speech recognition programs written at Bolt, Beranek, and Newman, have performed lexical retrieval by mapping all vocabulary words onto a common tree, in which branches correspond to phonemes. The root of the tree is the start of the word. Its first branches represent all the different initial phonemes contained in the vocabulary words. The second level branches connected to a given first level branch represent all the second phonemes in the vocabulary words which follow the first phoneme represented by the given first level branch. This is continued for multiple levels, so that words which start with a similar string of phonemes share a common initial path in the tree. When an utterance to be recognized is received, its successive parts are compared with a set of phoneme models, and the scores resulting from those comparisons are used to select those parts of the tree which probably correspond to the word to be recognized. The vocabulary words associated with those parts of the tree are then compared in greater detail against the word to be recognized.
Another method of lexical retrieval is disclosed in U.S. patent application Ser. No. 919,885, filed by Gillick et al. on Oct. 10th, 1986 and entitled "A Method For Creating And Using Multiple-Word Sound Models in Speech Recognition" (hereinafter referred to as Application Ser. No. 919,885). Application Ser. No. 919,885 has been assigned to the assignee of the present application, and is incorporated herein by reference. It discloses a method of prefiltering which uses linear time alignment to compare a sequence of models from the speech to be recognized against corresponding sequences of models which are associated with the beginning of one or more vocabulary words. This method compensate for its use of linear time alignment by combining its prefilter score produced by linear time alignment with another prefilter score which calculated in a manner that is very forgiving of changes in speaking rate or the improper insertion or deletion of speech sounds. This other prefilter score, which is referred to as the "histogram prefiltering" score in Application Ser. No. 919,885, is calculated by labeling each of a plurality of frames from the utterance to be recognized with the label of the phonetic frame model which compares most closely to it. Then each labeled frame has associated with it, for each vocabulary word, the probability that a given frame from the initial portion of that vocabulary word would be associated with that frame's label. These probabilities are combined over successive frames for each vocabulary word to produce the histogram prefilter score for that word.
The prefiltering method described in Application No. 919,885 provides good prefiltering, but the embodiment of that method shown is designed for separately spoken words, rather than continuous speech. In addition, although the so-called "histogram prefiltering method" is very computationally efficient and provides a good compliment to the prefilter score obtained by linear time alignment, its performance is not an optimal.
Several of the prior art prefiltering schemes described above associate phonetic labels with individual speech frames. Although methods for such frame labeling are well known in the art, most of them are not as accurate as desired. For example, random noise is often added to the speech sounds represented by frames, causing those frames to be mislabeled. Also, sampling errors often cause the values of frames to fluctuate. These sampling errors often result because many speech sounds are brief relative to the length of the frames, and the representation of such brief sounds by the frames tends to vary as a function of the relative timing between such sounds and such frames. For example, vowel sounds contains between one hundred to four hundred pulses per second generated by the opening and closing of the vocal cords, with the exact frequency defining the pitch of the speaker's voice. As a result, the frames recorded for a single continuous vowel sound often vary as a function of changes in the relative timing between such pulses and individual frames. The inaccuracies in frame values which result from all these causes not only cause individual frames to be mislabeled, but they also tend to cause any division of the speech into segments based upon such frame labeling to be erroneous.