It is a constant endeavor to find ways of speeding up the recognition of acoustic signals. A common tactic in large vocabulary automatic speech recognition systems is to quickly provide a short list of about fifty likely candidate words from a vocabulary consisting of several thousand words. Subsequently, detailed models for each of the words in this short list are used to match and score the word to the acoustic signal. The sequence of words with the highest score is chosen. The process of determining the short list, called the fast match or FM, reduces to a manageable number the hypotheses investigated by detailed acoustic models. It thereby results in a speech decoding system which operates with acceptable speed. An efficient representation of the vocabulary in terms of context-independent phonemes is a tree structure. FIG. 1 shows a sample vocabulary consisting of just five words with their phoneme constituents. The vocabulary is represented graphically in the form of a tree shown in FIG. 2. FIG. 2 starts with a root node 202 which corresponds to the beginning of an utterance. The tree is formed by tying together in a single node all phonemes shared from the beginning of the word. A word ending is referred to as a leaf 208, 212, 216, 222 and 226. In the example, there are two distinct phonemes, `AE` 204 and `IH` 218 which can start a word. Coincidentally, both of these have the same set of potential successors, `N` 206/220 and `T` 214/224. The sequence `AEN` may be followed by a `T` 210. Traversing the tree from the root 202 to one of leaves, spells out the word in the vocabulary indicated at the leaf. Use of this tree structure for speech recognition provides a representation of the complete vocabulary in accordance with the phoneme sequences that constitute each particular word in that vocabulary.
In one recognition system, each node of the fast match tree is expanded into a set of three states 301-303 as in FIG. 3. State three 303 has a self loop 304. The omission of self-loops on the first and second states 301, 302, reduces greatly the number of possible paths through the model. This results in a faster search than a model having self-loops on each of the states. Self-loops are generally employed in all states of the detailed match. This topology enforces a minimum stay of three frames in each phoneme which has been found to be highly desirable.
The flow of an existing recognition system is shown in FIG. 4. From the process start 402, the system goes into a wait state 404, and waits for an utterance. In the wait state 404 the system recognizes silence, and forms a first candidate list corresponding to silence. Thus, the first candidate list has only one entry which is silence. Generally, each word in a candidate list has an associated probability of occurrence. The probability is related to the relative frequency of occurrence of the word in the language's use. When an utterance is received, a detailed match of each entry in the list candidate is performed, and a probability distribution of possible ending times of that word is computed, 408. In the first iteration the only entry is silence. After each detailed match, a determination is made if the end of the utterance has been reached 410. If not, this distribution is used to perform a next fast match computation 412, and a next candidate list of words to follow the current most likely word is computed, 414. Again, a likelihood score for each of these new next candidate words is computed by the detailed match, along with its corresponding end-time probability distribution, 408. If the utterance is still not ended 410, this distribution is used in a next fast match computation 412, to determine a next set of candidate words, 414. This is continued until the end of the utterance is sensed, 410. The resulting decoded utterance is displayed and/or stored, 416 and the recognition process for that utterance is completed and the process stops, 418.
Examination of the method just described, reveals that each fast match results in a subsequent fast match search fort the next set of candidates to be evaluated. A fast match search is performed in the time region where the detailed match hypothesizes a word ending. Overlap in time due to multiple detailed matches which end in the same time region causes the method to perform wasteful redundancy of computations, in that it often repeats a fast match computation on the same data. This is a waste of assets and time. Thus, one aspect of this invention is to modify the recognition method from a process which alternates between detailed match and a fast match, to a process which computes all fast match candidate lists for an entire utterance and stores them in a table. The table is subsequently accessed for look-up by a contiguous and complete detailed match phase. This detailed match phase is only implemented following the entire fast match computations.
Another aspect of this invention is a simplification to the phoneme topology such as to reduce the number of states in the search procedure. The maintenance of a minimum stay of three frames in each graph node visited is imposed by allowing transitions only every third frame. This constraint enables the simplest possible Markov model for each phoneme.
For the purposes of this invention, the following definitions apply:
______________________________________ active list of nodes whose log score is list: within a user-defined parameter D of the log of the highest scoring node at the current time. potentials list of nodes potentially list: active at the next time frame given the set of currently active nodes. The potentials list includes the set of currently active nodes and their successors in the fast match graph. next active set of nodes in potentials list: list whose score is sufficiently high to allow the node to be included in the active list at the next time triplet. contender: is often called "acoustic list: fast match list" candidate is often called "final fast list: match list" utterance: a string of acoustic signals (words) to be decoded for speech recognition. It is most often a `sentence`, although it need not be a complete sentence. active node: a node whose score is within a user-defined range parameter `D` of the highest scoring node at that time. phoneme: is an individual speech sound; the building blocks of words. For example the word "she" is comprised of 2 phonemes: SH and IY. Often, one can think of phonemes as the constituent entries in the pronunciation of a word in the dictionary. candidate: A vocabulary word, or group of words, resulting from the fast match algorithm, which is/are possibilities for the words forming the acoustic signal. Each candidate is evaluated by the detailed match phase of decoding, in order to find the best scoring string of words which match the acoustics being recognized. hypothesis: same as candidate. triangular weighting function shaped window: like a triangle when plotted. this is used to combine lists associated with a range of times into a single list. For example the weights 0.1 0.6 1.1 0.6 0.1: applied to 5 consecutive lists of words would multiply the scores in the first and fifth list by 0.1, in the second and fourth list by 0.6, and the third list by 1.1. This would give preference to words in the third list by increasing their scores while decreasing the scores of words in the other lists. similarly the second and fourth lists would be given preference to the first and fifth lists. call: refers to a function call in a computer program. Call to DM means executing the detailed match function. since the program iterates, the detailed match function is executed many times; next call: means the next time the function is executed. range (D): A contender and/or candidate inclusion range parameter, relating the log of the candidate/node score to the log of the score of the node with the maximum score at a particular time. `D` is typically between 10 and 17. A small value results in a faster recognition process, but potentially more error prone recognition process. test sentence: refers to a chunk of speech (said by a speaker) upon which the recognition process is being implemented. beam: refers to the set of nodes with have a sufficiently high score. It is often computed by finding the log of the score of the highest scoring node `Nmax` at each time, and subtracting a constant (range parameter) D from it. Any node with a score with a log greater than `Nmax` minus `D`, is said to be in the beam. beam search: a method wherein only evaluations are performed only for those nodes which are possible followers to nodes with a sufficiently high score, rather than evaluating the score for every node in the node versus time matrix. Viterbi Scores are either `unnormalized` or score: `normalized`. Unnormalized scores get smaller as time goes on, so it is not meaningful to compare the scores of different times. Therefore the scores are normalized to enable comparisons of scores at different times. A normalized Viterbi score refers to the scores of the nodes at a given time in a matrix of scores. Normalization of the score of a node `N` at any time may be implemented by taking the difference in unnormalized scores between the best-scoring node `Nmax` (at that time) and the unnormalized score of node `N`. These normalized scores can be meaningfully compared at different times. time frame: a particular time duration, usually 10 millisecond. It can also be understood as a `point in time`. ______________________________________
active list of nodes whose log score is PA1 list: within a user-defined parameter D of the log of the highest scoring node at the current time. PA1 potentials list: list of nodes potentially active at the next time frame given the set of currently active nodes. PA1 next active list: set of nodes in potentials list whose score is sufficiently high to allow the node to be included in the active list at the next time triplet. PA1 contender: is often called "acoustic PA1 list: fast match list" PA1 candidate list: is often called "final fast match list" PA1 utterance: a string of acoustic signals (words) to be decoded for speech recognition. It is most often a `sentence`, although it need not be a complete sentence. PA1 active node: a node whose score is within a user-defined range parameter `D` of the highest scoring node at that time. PA1 phoneme: is an individual speech sound; the building blocks of words. For example the word "she" is comprised of 2 phonemes: SH and IY. Often, one can think of phonemes as the constituent entries in the pronunciation of a word in the dictionary. PA1 candidate: A vocabulary word, or group of words, resulting from the fast match algorithm, which is/are possibilities for the words forming the acoustic signal. Each candidate is evaluated by the detailed match phase of decoding, in order to find the best scoring string of words which match the acoustics being recognized. PA1 hypothesis: same as candidate. PA1 triangular window: weighting function shaped like a triangle when plotted. This is used to combine lists associated with a range of times into a single list. For example the weights 0.1 0.6 1.1 0.6 0.1: applied to 5 consecutive lists of words would multiply the scores in the first and fifth list by 0.1, in the second and fourth list by 0.6, and the third list by 1.1. This would give preference to words in the third list by increasing their scores while decreasing the scores of words in the other lists. Similarly the second and fourth lists would be given preference to the first and fifth lists. PA1 call: refers to a function call in a computer program. Call to DM means executing the detailed match function. Since the program iterates, the detailed match function is executed many times; PA1 next call: means the next time the function is executed. PA1 range (D): A contender and/or candidate inclusion range parameter, relating the log of the candidate/node score to the log of the score of the node with the maximum score at a particular time. `D` is typically between 10 and 17. A small value results in a faster recognition process, but potentially more error prone recognition process. PA1 test sentence: refers to a chunk of speech (said by a speaker) upon which the recognition process is being implemented. PA1 beam: refers to the set of nodes with have a sufficiently high score. It is often computed by finding the log of the score of the highest scoring node `Nmax` at each time, and subtracting a constant (range parameter) D from it. Any node with a score with a log greater than `Nmax` minus `D`, is said to be in the beam. PA1 beam search: a method wherein only evaluations are performed only for those nodes which are possible followers to nodes with a sufficiently high score, rather than evaluating the score for every node in the node versus time matrix. PA1 Viterbi score: Scores are either `unnormalized` or `normalized`. Unnormalized scores get smaller as time goes on, so it is not meaningful to compare the scores of different times. Therefore the scores are normalized to enable comparisons of scores at different times. PA1 time frame: a particular time duration, usually 10 millisecond. It can also be understood as a `point in time`.
The potentials list includes the set of currently active nodes and their successors in the fast match graph. PA2 A normalized Viterbi score refers to the scores of the nodes at a given time in a matrix of scores. Normalization of the score of a node `N` at any time may be implemented by taking the difference in unnormalized scores between the best-scoring node `Nmax` (at that time) and the unnormalized score of node `N`. These normalized scores can be meaningfully compared at different times.