While machines which recognize discrete, or isolated, words are well-known in the art, there is on-going research and development in constructing large vocabulary systems for recognizing continuous speech. Examples of discrete speech recognition systems are described in U.S. Pat. No. 4,783,803 (Baker et al., Nov. 8, 1988) and U.S. Pat. No. 4,837,831 (Gillick et al., Jun. 6, 1989), both of which are assigned to the assignee of the present application and are herein incorporated by reference. Generally, most speech recognition systems match an acoustic description of words, or parts of words, in a predetermined vocabulary against a representation of the acoustic signal generated by the utterance of the word to be recognized. One method for establishing the vocabulary is through the incorporation of a training process, by which a user "trains" the computer to identify a certain word having a specific acoustic segment.
A large number of calculations are required to identify a spoken word from a given large vocabulary in a speech recognition system. The number of calculations would effectively prevent real-time identification of spoken words in such a speech recognition system. Pre-filtering is one means of identifying a preliminary set of word models against which an acoustic model may be compared. Pre-filtering enables such a speech recognition system to identify spoken words in real-time.
Present pre-filtering systems used in certain prior art discrete word recognition systems rely upon identification of the beginning of a word. One example, as described in detail in U.S. Pat. No. 4,837,831, involves establishing an anchor for each utterance of each word, which anchor then forms the starting point of calculations. That patent discloses a system in which each vocabulary word is represented by a sequence of statistical node models. Each such node model is a multi-dimensional probability distribution, each dimension of which represents the probability distribution for the values of a given frame parameter if its associated frame belongs to the class of sounds represented by the node model. Each dimension of the probability distribution is represented by two statistics, an estimated expected value, or mu, and an estimated absolute deviation, or sigma. A method for deriving statistical models of a basic type is disclosed in U.S. Pat. No. 4,903,305 (Gillick et al., Feb. 20, 1990), which is assigned to the assignee of the present application and which is herein incorporated by reference.
U.S. Pat. No. 4,903,305 discloses dividing the nodes from many words into groups of nodes with similar statistical acoustic models, forming clusters, and calculating a statistical acoustic model for each such cluster. The model for a given cluster is then used in place of the individual node models from different words which have been grouped into that cluster, greatly reducing the number of models which have to be stored. One use of such cluster models is found in U.S. Pat. No. 4,837,831 (Gillick et al., Jun. 6, 1989), cited above. In that patent, the acoustic description of the utterance to be recognized includes a succession of acoustic descriptions, representing a sequence of sounds associated with that utterance. A succession of the acoustic representations from the utterance to be recognized are compared against the succession of acoustic models associated with each cluster model to produce a cluster likelihood score for each such cluster. These cluster models are "wordstart" models, that is, models which normally represent the initial portion of vocabulary words. The likelihood score produced for a given wordstart cluster model is used as an initial prefiltering score for each of its corresponding words. Extra steps are included which compare acoustic models from portions of each such word following that represented by its wordstart model against acoustic descriptions from the utterance to be recognized. Vocabulary words having the worst scoring wordstart models are pruned from further consideration before performing extra prefilter scoring steps. The comparison between the succession of acoustic descriptions associated with the utterance to be recognized and the succession of acoustic models in such cluster model are performed using linear time alignment. The acoustic description of the utterance to be recognized comprises a sequence of individual frames, each describing the utterance during a brief period of time, and a series of smoothed frames, each derived from a weighted average of a plurality of individual frames, is used in the comparison against the cluster model.
Other methods for reducing the size of a set against which utterances are to be identified by the system include pruning, and lexical retrieval. U.S. Pat. No. 4,837,831, cited above, discloses a method of prefiltering which compares a sequence of models from the speech to be recognized against corresponding sequences of models which are associated with the beginning of one or more vocabulary words. This method compensates for its use of linear time alignment by combining its prefilter score produced by linear time alignment with another prefilter score which is calculated in a manner that is forgiving of changes in speaking rate or improper insertion or deletion of speech sounds.
The statistical method of hidden Markov modeling, as incorporated into a continuous speech recognition system, is described in detail in U.S. Pat. No. 4,803,729 (Baker et al., Feb. 7, 1989), which is assigned to the assignee of this application, and which is herein incorporated by reference. In that patent, use of the hidden Markov model as a technique for determining which phonetic label should be associated with each frame is disclosed. That stochastic model, utilizing the Markov assumption, greatly reduces the amount of computation required to solve complex statistical probability equations such as are necessary for word recognition systems. Although the hidden Markov model increases the speed of such speech recognition systems, the problem remains in applying such a statistical method to continuous word recognition where the beginning of each word is contained in a continuous sequence of utterances.
Many discrete speech recognition systems use some form of a "dynamic programming" algorithm. Dynamic programming is an algorithm for implementing certain calculations to which a hidden Markov Model leads. In the context of speech recognition systems, dynamic programming performs calculations to determine the probabilities that a hidden Markov Model would assign to given data.
Typically, speech recognition systems using dynamic programming represent speech as a sequence of frames, each of which represents the speech during a brief period of time, e.g., fiftieth or hundredth of a second. Such systems normally model each vocabulary word with a sequence of node models which represent the sequence of different frames associated with that word. Roughly speaking, the effect of dynamic programming, at the time of recognition, is to slide, or expand and contract, an operating region, or window, relative to the frames of speech so as to align those frames with the node models of each vocabulary word to find a relatively optimal time alignment between those frames and those nodes. The dynamic programming in effect calculates the probability that a given sequence of frames matches a given word model as a function of how well each such frame matches the node model with which it has been time-aligned. The word model which has the highest probability score is selected as corresponding to the speech. Dynamic programming obtains relatively optimal time alignment between the speech to be recognized and the nodes of each word model, which compensates for the unavoidable differences in speaking rates which occur in different utterances of the same word. In addition, since dynamic programming scores words as a function of the fit between word models and the speech over many frames, it usually gives the correct word the best score, even if the word has been slightly misspoken or obscured by background sound. This is important, because humans often mispronounce words either by deleting or mispronouncing proper sounds, or by inserting sounds which do not belong. Even absent any background sound, there is an inherent variability to human speech which must be considered in a speech recognition system.
Dynamic programming requires a tremendous amount of computation. In order for it to find the optimal time alignment between a sequence of frames and a sequence of node models, it must compare most frames against a plurality of node models. One method of reducing the amount of computation required for dynamic programming is to use pruning. Pruning terminates the dynamic programming of a given portion of speech against a given word model if the partial probability score for that comparison drops below a given threshold. This greatly reduces computation, since the dynamic programming of a given portion of speech against most words produces poor dynamic programming scores rather quickly, enabling most words to be pruned after only a small percent of their comparison has been performed. Unfortunately, however, even with such pruning, the amount of computation required in large vocabulary systems of the type necessary to transcribe normal dictation.
Continuous speech computational requirements are even greater. In continuous speech, the type of which humans normally speak, words are run together, without pauses or other simple cues to indicate where one word ends and the next begins. When a mechanical speech recognition system attempts to recognize continuous speech, it initially has no way of identifying those portions of speech which correspond to individual words. Speakers of English apply a host of duration and coarticulation rules when combining phonemes into words and sentences, employing the same rules in recognizing spoken language. A speaker of English, given a phonemic spelling of an unfamiliar word from a dictionary, can pronounce the word recognizably or recognize the word when it is spoken. On the other hand, it is impossible to put together an "alphabet" of recorded phonemes which, when concatenated, will sound like natural English words. It comes as a surprise to most speakers, for example, to discover that the vowels in "will" and "kick", which are identical according to dictionary pronunciations, are as different in their spectral characteristics as the vowels in "not" and "nut", or that the vowel in "size" has more than twice the duration of the same vowel in "seismograph".
One approach to this problem of recognizing discrete words in continous speech is to treat each successive frame of the speech as the possible beginning of a new word, and to begin dynamic programming at each such frame against the start of each vocabulary word. However, this approach requires a tremendous amount of computation. A more efficient method used in the prior art begins dynamic programming against new words only at those frames for which the dynamic programming indicates that the speaking of a previous word has just ended. Although this latter method is a considerable improvement, there remains a need to further reduce computation by reducing the number of words against which dynamic programming is started when there is indication that a prior word has ended.
One such method of reducing the number of vocabulary words against which dynamic programming is started in continuous speech recognition associates a phonetic label with each frame of the speech to be recognized. The phonetic label identifies which ones of a plurality of phonetic frame models compares most closely to a given frame of speech. The system then divides the speech into segments of successive frames associated with a single phonetic label. For each given segment, the system takes the sequence of five phonetic labels associated with that segment plus the next four segments, and refers to a look-up table to find the set of vocabulary words which previously have been determined to have a reasonable probability of starting with that sequence of phonetic labels. As referred to above, this is known as a "wordstart cluster". The system then limits the words against which dynamic programming could start in the given segment to words in that cluster or set.
A method for handling continuous speech recognition is described in U.S. Pat. No. 4,805,219 (Baker et al., Feb. 14, 1989), which is assigned to the assignee of this application, and which is herein incorporated by reference. In that patent, both the speech to be recognized and a plurality of speech pattern models are time-aligned against a common time-aligning model. The resulting time-aligned speech model is then compared against each of the resulting time-aligned pattern models. The time-alignment against a common time-alignment model causes the comparisons between the speech model and each of the pattern models to compensate for variations in the rate at which the portion of speech is spoken, without requiring each portion of speech to be separately time-aligned against each pattern model.
One method of continuous speech recognition is described in U.S. Pat. No. 4,803,729, cited above. In that patent, once the speech to be recognized is converted into a sequence of acoustic frames, the next step consists of "smooth frame labelling". This smooth frame labelling method associates a phonetic frame label with each frame of the speech to be labelled as a function of: (1) the closeness with which the given frame compares to each of a plurality of the acoustic phonetic frame models; (2) an indication of which one or more of the phonetic frame models most probably correspond with the frames which precede and follow the given frame, and; (3) the transition probability which indicates for the phonetic models associated with those neighboring frames which phonetic models are most likely associated with the given frame.
Up to this time, no pre-filtering system has been implemented which provides the desired speed and accuracy in a large vocabulary continuous speech recognition system. Thus, there remains a need for an improved continuous speech recognition system which rapidly and accurately recognizes words contained in a sequence of continuous utterances.
It is thus an object of the present invention to provide a continuous speech pre-filtering system for use in a continuous speech recognition computer system.