The present invention relates to computer speech recognition. More particularly, the present invention relates to computer speech recognition performed by conducting a prefix tree search of a silence bracketed lexicon.
The most successful current speech recognition systems employ probabilistic models known as hidden Markov models (HMMs). A hidden Markov model includes a plurality of states, wherein a transition probability is defined for each transition from each state to every state, including transitions to the same state. An observation is probabilistically associated with each unique state. The transition probabilities between states (the probabilities that an observation will transition from one state to the next) are not all the same. Therefore, a search technique, such as a Viterbi algorithm, is employed in order to determine a most likely state sequence for which the overall probability is maximum, given the transition probabilities between states and the observation probabilities.
A sequence of state transitions can be represented, in a known manner, as a path through a trellis diagram that represents all of the states of the HMM over a sequence of observation times. Therefore, given an observation sequence, a most likely path through the trellis diagram (i.e., the most likely sequence of states represented by an HMM) can be determined using a Viterbi algorithm.
In current speech recognition systems, speech has been viewed as being generated by a hidden Markov process. Consequently, HMMs have been employed to model observed sequences of speech spectra, where specific spectra are probabilistically associated with a state in an HMM. In other words, for a given observed sequence of speech spectra, there is a most likely sequence of states in a corresponding HMM.
This corresponding HMM is thus associated with the observed sequence. This technique can be extended, such that if each distinct sequence of states in the HMM is associated with a sub-word unit, such as a phoneme, then a most likely sequence of sub-word units can be found. Moreover, using models of how sub-word units are combined to form words, then using language models of how words are combined to form sentences, complete speech recognition can be achieved.
When actually processing an acoustic signal, the signal is typically sampled in sequential time intervals called frames. The frames typically include a plurality of samples and may overlap or be contiguous. Each frame is associated with a unique portion of the speech signal. The portion of the speech signal represented by each frame is analyzed to provide a corresponding acoustic vectors. During speech recognition, a search is performed for the state sequence most likely to be associated with the sequence of acoustic vectors.
In order to find the most likely sequence of states corresponding to a sequence of acoustic vectors, the Viterbi algorithm is employed. The Viterbi algorithm performs a computation which starts at the first frame and proceeds one frame at a time, in a time-synchronous manner. A probability score is computed for each state in the state sequences (i.e., the HMMs) being considered. Therefore, a cumulative probability score is successively computed for each of the possible state sequences as the Viterbi algorithm analyzes the acoustic signal frame by frame. By the end of an utterance, the state sequence (or HMM or series of HMMs) having the highest probability score computed by the Viterbi algorithm provides the most likely state sequence for the entire utterance. The most likely state sequence is then converted into a corresponding spoken subword unit, word, or word sequence.
The Viterbi algorithm reduces an exponential computation to one that is proportional to the number of states and transitions in the model and the length of the utterance. However, for a large vocabulary, the number of states and transitions becomes large and the computation required to update the probability score at each state in each frame for all possible state sequences takes many times longer than the duration of one frame, which is typically approximately 10 milliseconds in duration.
Thus, a technique called pruning, or beam searching, has been developed to greatly reduce computation needed to determine the most likely state sequence. This type of technique eliminates the need to compute the probability score for state sequences that are very unlikely. This is typically accomplished by comparing, at each frame, the probability score for each remaining state sequence (or potential sequence) under consideration with the largest score associated with that frame. If the probability score of a state for a particular potential sequence is sufficiently low (when compared to the maximum computed probability score for the other potential sequences at that point in time) the pruning algorithm assumes that it will be unlikely that such a low scoring state sequence will be part of the completed, most likely state sequence. The comparison is typically accomplished using a minimum threshold value. Potential state sequences having a score that falls below the minimum threshold value are removed from the searching process. The threshold value can be set at any desired level, based primarily on desired memory and computational savings, and a desired error rate increase caused by memory and computational savings.
Another conventional technique for further reducing the magnitude of computation required for speech recognition includes the use of a prefix tree. A prefix tree represents the lexicon of the speech recognition system as a tree structure wherein all of the words likely to be encountered by the system are represented in the tree structure.
In such a prefix tree, each subword unit (such as a phoneme) is typically represented by a branch which is associated with a particular phonetic model (such as an HMM). The phoneme branches are connected, at nodes, to subsequent phoneme branches. All words in the lexicon which share the same first phoneme share the same first branch. All words which have the same first and second phonemes share the same first and second branches. By contrast, words which have a common first phoneme, but which have different second phonemes, share the same first branch in the prefix tree but have second branches which diverge at the first node in the prefix tree, and so on. The tree structure continues in such a fashion such that all words likely to be encountered by the system are represented by the end nodes of the tree (i.e., the leaves on the tree).
It is apparent that, by employing a prefix tree structure, the number of initial branches will be far fewer than the typical number of words in the lexicon or vocabulary of the system. In fact, the number of initial branches cannot exceed the total number of phonemes (approximately 40-50), regardless of the size of the vocabulary or lexicon being searched. Although if allophonic variations are used, then the initial number of branches could be large, depending on the allophones used.
This type of structure lends itself to a number of significant advantages. For example, given the small number of initial branches in the tree, it is possible to consider the beginning of all words in the lexicon, even if the vocabulary is very large, by evaluating the probability of each of the possible first phonemes. Further, using pruning, a number of the lower probability phoneme branches can be eliminated very early in the search. Therefore, while the second level of the tree has many more branches than the first level, the number of branches which are actually being considered (i e., the number of hypotheses), is also reduced over the number of possible branches.
Speech recognition systems employing the above techniques can typically be classified in two types. The first type is a continuous speech recognition (CSR) system which is capable of recognizing fluent speech. The second type of system is an isolated speech recognition (ISR) system which is typically employed to recognize only isolated speech (or discreet speech), but which is also typically more accurate and efficient than continuous speech recognition systems because the search space is generally smaller. Also, isolated speech recognition systems have been thought of as a special case of continuous speech recognition, because continuous speech recognition systems generally can accept isolated speech as well. They simply do not perform as well when attempting to recognize isolated speech.
Silence information plays a role in both systems. To date, both types of speech recognition systems have treated silence as a special word in the lexicon. The silence word participates in the normal search process so that it can be inserted between words as it is recognized.
However, it is known that considering word transitions in a speech recognition system is a computationally intensive and costly process. Therefore, in an isolated speech recognition system in which silence is treated as a separate word, the transition from the silence word to all other words in the lexicon must be considered, as well as the transition from all words in the lexicon (or all remaining words at the end of the search) to the silence word.
Further, in continuous speech recognition systems, even if the system has identified that the speaker is speaking discretely, or in an isolated fashion, the CSR system still considers hypotheses which do not have silence between words. This leads to a tendency to improperly break one word into two or more words. Of course, this results in a higher error rate than would otherwise be expected. Moreover, it is computationally inefficient since it still covers part of the search space which belongs to continuous speech but not isolated speech.
In addition to employing the silence phone as a separate word in the lexicon, conventional modeling of the silence phone has also led to problems and errors in prior speech recognition systems. It is widely believed that silence is independent of context. Thus, silence has been modeled in conventional speech recognition systems regardless of context. In other words, the silence phone has been modeled the same, regardless of the words or subword units that precede or follow it. This not only decreases the accuracy of the speech recognition system, but also renders it less efficient than it could be with modeling in accordance with the present invention.
A speech recognition system recognizes speech based on an input data stream indicative of the speech. Possible words represented by the input data stream are provided as a prefix tree including a plurality of phoneme branches connected at nodes. The plurality of phoneme branches are bracketed by at least one input silence branch corresponding to a silence phone on an input side of the prefix tree and at least one output silence branch corresponding to a silence phone on an output side of the prefix tree.
In one preferred embodiment, a plurality of silence branches are provided in the prefix tree. The plurality of silence branches represent context dependent silence phones.
In another preferred embodiment of the present invention, the speech recognition system includes both a continuous speech recognition system lexicon, and an isolated speech recognition system lexicon. The system switches between using the CSR lexicon and the ISR lexicon based upon a type of speech then being employed by the user of the system.