The present invention relates to computer speech recognition. More particularly, the present invention relates to a method of recognizing both continuous and isolated speech.
The most successful current speech recognition systems employ probabilistic models known as hidden Markov models (HMMs). A hidden Markov model includes a plurality of states, wherein a transition probability is defined for each transition from each state to every other state, including transitions to the same state. An observation is probabilistically associated with each unique state. The transition probabilities between states (the probabilities that an observation will transition from one state to the next) are not all the same. Therefore, a search technique, such as a Viterbi algorithm, is employed in order to determine a most likely state sequence for which the overall probability is maximum, given the transition probabilities between states and observation probabilities.
In current speech recognition systems, speech has been viewed as being generated by a hidden Markov process. Consequently, HMMs have been employed to model observed sequences of speech spectra, where specific spectra are probabilistically associated with a state in an HMM. In other words, for a given observed sequence of speech spectra, there is a most likely sequence of states in a corresponding HMM.
This corresponding HMM is thus associated with the observed sequence. This technique can be extended, such that if each distinct sequence of states in the HMM is associated with a sub-word unit, such as a phoneme, then a most likely sequence of sub-word units can be found. Moreover, using models of how sub-word units are combined to form words, then using language models of how words are combined to form sentences, complete speech recognition can be achieved.
When actually processing an acoustic signal, the signal is typically sampled in sequential time intervals called frames. The frames typically include a plurality of samples and may overlap or be contiguous. Each frame is associated with a unique portion of the speech signal. The portion of the speech signal represented by each frame is analyzed to provide a corresponding acoustic vector. During speech recognition, a search of speech unit models is performed to determine the state sequence most likely to be associated with the sequence of acoustic vectors.
In order to find the most likely sequence of states corresponding to a sequence of acoustic vectors, the Viterbi algorithm may be employed. The Viterbi algorithm performs a computation which starts at the first frame and proceeds one frame at a time, in a time-synchronous manner. A probability score is computed for each state in the state sequences (i.e., in the HMMs) being considered. Therefore, a cumulative probability score is successively computed for each of the possible state sequences as the Viterbi algorithm analyzes the acoustic signal frame by frame. By the end of an utterance, the state sequence (or HMM or series of HMMs) having the highest probability score computed by the Viterbi algorithm provides the most likely state sequence for the entire utterance. The most likely state sequence is then converted into a corresponding spoken subword unit, word, or word sequence.
The Viterbi algorithm reduces an exponential computation to one that is proportional to the number of states and transitions in the model and the length of the utterance. However, for a large vocabulary, the number of states and transitions becomes large and the computation required to update the probability score at each state in each frame for all possible state sequences takes many times longer than the duration of one frame, which is typically approximately 10 milliseconds in duration.
Thus, a technique called pruning, or beam searching, has been developed to greatly reduce computation needed to determine the most likely state sequence. This type of technique eliminates the need to compute the probability score for state sequences that are very unlikely. This is typically accomplished by comparing, at each frame, the probability score for each remaining state sequence (or potential sequence) under consideration with the highest score associated with that frame. If the probability score of a state for a particular potential sequence is sufficiently low (when compared to the maximum computed probability score for the other potential sequences at that point in time) the pruning algorithm assumes that it will be unlikely that such a low scoring state sequence will be part of the completed, most likely state sequence. The comparison is typically accomplished using a minimum threshold value. Potential state sequences having a score that falls below the minimum threshold value are removed from the searching process. The threshold value can be set at any desired level, based primarily on desired memory and computational savings, and a desired error rate increase caused by memory and computational saving. The retained state sequences will be referred to as the active-hypotheses.
Another conventional technique for further reducing the magnitude of computation required for speech recognition includes the use of a prefix tree. A prefix tree represents the lexicon of the speech recognition system as a tree structure wherein all of the words likely to be encountered by the system are represented in the tree structure.
In such a prefix tree, each subword unit (such as a phoneme) is typically represented by a branch which is associated with a particular acoustic model (such as an HMM). The phoneme branches are connected, at nodes, to subsequent phoneme branches. All words in the lexicon which share the same first phoneme share the same first branch. All words which have the same first and second phonemes share the same first and second branches. By contrast, words which have a common first phoneme, but which have different second phonemes, share the same first branch in the prefix tree but have second branches which diverge at the first node in the prefix tree, and so on. The tree structure continues in such a fashion such that all words likely to be encountered by the system are represented by the end nodes of the tree (i.e., the leaves on the tree).
It is apparent that, by employing a prefix tree structure, the number of initial branches will be far fewer than the typical number of words in the lexicon or vocabulary of the system. In fact, the number of initial branches cannot exceed the total number of phonemes (approximately 40-50), regardless of the size of the vocabulary or lexicon being searched. Although if allophonic variations are used, then the initial number of branches could be large, depending on the allophones used.
Speech recognition systems employing the above techniques can typically be classified in two types. The first type is a continuous speech recognition (CSR) system which is capable of recognizing fluent speech. The CSR system is trained (i.e., develops acoustic models) based on continuous speech data in which one or more readers read training data into the system in a continuous or fluent fashion. The acoustic models developed during training are used to recognize speech.
The second type of system is an isolated speech recognition (ISR) system which is typically employed to recognize only isolated speech (or discreet speech). The ISR system is trained (i.e., develops acoustic models) based on discrete or isolated speech data in which one or more readers are asked to read training data into the system in a discrete or isolated fashion with pauses between each word. An ISR system is also typically more accurate and efficient than continuous speech recognition systems because word boundaries are more definite and the search space is consequently more tightly constrained. Also, isolated speech recognition systems have been thought of as a special case of continuous speech recognition, because continuous speech recognition systems generally can accept isolated speech as well. They simply do not perform as well when attempting to recognize isolated speech.
It has been observed that users of CSR systems typically tend to speak fluently until the system begins to make errors, or until the users are pondering the composition of the document. At that point, the user may slow down, often to the point of pausing between words. In both cases, the user believes that, by speaking more slowly and distinctly, with pauses between words, the user will assist the recognition system, when in fact the user is actually stressing the system beyond its capabilities.
It is not adequate, however, simply to attempt to recognize continuous speech with an isolated speech recognition system. ISR systems typically perform much worse than CSR systems when attempting to recognize continuous speech. This is because there is no crossword coarticulation in the ISR training data.