The rapid and accurate recognition of human speech by a computer system has been a long-sought goal by developers of computer systems. The benefits that would result from such a computer speech recognition (CSR) system are substantial. For example, rather than typing a document into a computer system, a person could simply speak the words of the document, and the CSR system would recognize the words and store the letters of each word as if the words had been typed. Since people generally can speak faster than type, efficiency would be improved. Also, people would no longer need to learn how to type. Computers could also be used in many applications where their use is currently impracticable because a person's hands are occupied with tasks other than typing.
Typical CSR systems recognize words by comparing a spoken utterance to a model of each word in a vocabulary. The word whose model best matches the utterance is recognized as the spoken word. A CSR system may model each word as a sequence of phonemes that compose the word. To recognize an utterance, the CSR system identifies a word sequence, phonemes of which best match the utterance, These phonemes may, however, not exactly correspond to the phonemes that compose a word. Thus, CSR systems typically use a probability analysis to determine which word most closely corresponds to the identified phonemes.
When recognizing an utterance, a CSR system converts the analog signal representing the utterance to a more useable form for further processing. The CSR system first converts the analog signal into a digital form. The CSR system then applies a signal processing technique, such as fast fourier transforms (FFT), linear predictive coding (LPC), or filter banks, to the digital form to extract an appropriate parametric representation of the utterance. A commonly used representation is a "feature vector" with FFT or LPC coefficients that represent the frequency and/or energy bands of the utterance at various intervals (referred to as "frames"). The intervals can be short or long based on the computational capacity of the computer system and the desired accuracy of the recognition process. Typical intervals may be in the range of 10 milliseconds. That is, the CSR system would generate a feature vector for every 10 milliseconds of the utterance. Each frame is typically 25 ms long. Therefore, a 25 ms long frame is generated every 10 ms. There is an overlap between successive frames.
To facilitate the processing of the feature vectors, each feature vector is quantized into one of a limited number (e.g., 256) of "quantization vectors." That is, the CSR system defines a number of quantization vectors that are selected to represent typical or average ranges of feature vectors. The CSR system then compares each feature vector to each of the quantization vectors and selects the quantization vector that most closely resembles the feature vector to represent the feature vector. Each quantization vector is uniquely identified by a number (e.g., between 1 and 256), which is referred to as a "codeword." When a feature vector is represented as a quantization vector, there is a loss of information because many different feature vectors map to the same quantization vector. To ensure that this information loss will not seriously impact recognition, CSR systems may define thousands or millions of quantization vectors. The amount of storage needed to store the definition of such a large number of quantization vectors can be considerable. Thus, to reduce the amount of storage needed, CSR systems segment feature vectors and quantize each segment into one of a small number (e.g., 256) quantization vectors. Thus, each feature vector is represented by a quantization vector (identified by a codeword) for each segment. For simplicity of explanation, a CSR system that does not segment a feature vector and thus has only one codeword per feature vector (or frame) is described.
As discussed above, a spoken utterance often does not exactly correspond to a model of a word. The difficulty in finding an exact correspondence is due to the great variation in speech that is not completely and accurately captured by the word models. These variations result from, for example, the accent of the speaker, the speed and pitch at which a person speaks, the current health (e.g., with a cold) of the speaker, the age and sex of the speaker, etc. CSR systems that use probabilistic techniques have been more successful in accurately recognizing speech than techniques that seek an exact correspondence.
One such probabilistic technique that is commonly used for speech recognition is hidden Markov modeling. A CSR system may use a hidden Markov model ("HMM") for each word in the vocabulary. The HMM for a word includes probabilistic information from which can be derived the probability that any sequence of codewords corresponds to that word. Thus, to recognize an utterance, a CSR system converts the utterance to a sequence of codewords and then uses the HMM for each word to determine the probability that the word corresponds to the utterance. The CSR system recognizes the utterance as the word with the highest probability.
An HMM is represented by a state diagram State diagrams are traditionally used to determine a state that a system will be in after receiving a sequence of inputs. A state diagram comprises states and transitions between source and destination states. Each transition has associated with it an input which indicates that when the system receives that input and it is in the source state, the system will transition to the destination state. Such a state diagram could, for example, be used by a system that recognizes each sequence of codewords that compose the words in a vocabulary. As the system processes each codeword, the system determines the next state based on the current state and the codeword being processed. In this example, the state diagram would have a certain final state that corresponds to each word. However, if multiple pronunciations of a word are represented, then each word may have multiple final states. If after processing the codewords, the system is in a final state that corresponds to a word, then that sequence of codewords would be recognized as the word of the final state.
An HMM, however, has a probability associated with each transition from one state to another for each codeword. For example, if an HMM is in state 2, then the probability may be 0.1 that a certain codeword would cause a transition from the current state to a next state, and the probability may be 0.2 that the same codeword would cause a transition from the current state to a different next state. Similarly, the probability may be 0.01 that a different codeword would cause a transition from the current state to a next state. Since an HMM has probabilities associated with its state diagram, the determination of the final state for a given sequence of codewords can only be expressed in terms of probabilities. Thus, to determine the probability of each possible final state for a sequence of codewords, each possible sequence of states for the state diagram of the HMM needs to be identified and the associated probabilities need to be calculated. Each such sequence of states is referred to as a state path.
To simplify recognition, rather than use an HMM with a large state diagram representing the probabilities for each possible sequence of codewords for each possible word, typical CSR systems represent each possible phonetic unit with an HMM and represent each word as a sequence of the phonetic units. Traditionally, the phonetic unit has been a phoneme. However, other phonetic units, such as senones, have been used. (See Hwang et al., "Predicting Unseen Triphones with Senones," Proc. ICASSP '93, 1993, Vol. II, pp. 311-314.) With an HMM for each phonetic unit, a CSR system evaluates the probability that a sequence of phonemes represents a certain word by concatenating the HMMs for the phonemes that compose the word and evaluating the resulting HMM.
Each HMM contains for each state the probability that each codeword will result in a transition to each other state. The probabilities associated with each state transition are represented by codeword-dependent output probabilities for that state and a codeword-independent transition probabilities for the state. The codeword-dependent output probability for a state reflects the likelihood that the phoneme will contain that codeword as the next codeword after a sequence of codewords results in the HMM being in that state. The codeword-independent transition probabilities for a state indicates the probability that the HMM will transition from that state to each next state. Thus, the probability that the HMM will transition from a current state to a next state when a codeword is input is the product of the transition probability from the current state to the next state and the output probability for the received codeword.
FIG. 1 illustrates a sample HMM for a phoneme. The HMM contains three states and two transitions out of each state. Generally, CSR systems use the same state diagram to represent each phonemes, but with phoneme-dependent output and transition probabilities. According to this HMM, a transition can only occur to the same state or to the next state which models the left-to-right nature of speech. Each state has an associated output probability table and a transition probability table that contain the output and transition probabilities. As shown in FIG. 1, the output probability for codeword 5 is 0.1 when the HMM is in state 2, and the transition probability to state 3 is 0.8 when the HMM is in state 2. Thus, the probability that the HMM will transition to state 3 from state 2 when codeword 5 is received is 0.08 (i.e., 0.1.times.0.8).
To determine the probability that a sequence of codewords represents a phoneme, the CSR system may generate a probability lattice. The probability lattice for the HMM of a phoneme represents a calculation of the probabilities for each possible state path for the sequence of codewords. The probability lattice contains a node for each possible state that the HMM can be in for each codeword in the sequence. Each node contains the accumulated probability that the codewords processed so far will result in the HMM being in the state associated with that node. The sum of the probabilities in the nodes for a particular codeword indicates the likelihood that the codewords processed so far represent a prefix portion of the phoneme.
FIG. 2 is a diagram illustrating a probability lattice. The probability lattice represents a calculation of the probabilities for each possible state of the HMM shown in FIG. 1 when the codeword sequence "7, 5, 2, 1, 2" is processed. The horizontal axis corresponds to the codewords and vertical axis corresponds to the states of the HMM. Each node of the lattice contains the maximum probability of the probability of each source state times the output and transition probabilities, rather than the sum of the probabilities. For example, node 201 contains a probability of 8.6E-6, which is the maximum of 3.6E-4.times.0.01.times.0.9 and 1.4E-3.times.0.03.times.0.2. There are many different state paths (i.e., sequences of states) that lead to any node. For example, node 201 may be reached by state paths "1, 2, 3, 3," "1, 2, 2, 3," and "1, 1, 2, 3." Each state path has a probability that the HMM will follow that state path when processing the codeword sequence. The probability in each node is the maximum of the probabilities of each state path that leads to the node. These maximum probabilities are used for Viterbi alignment as discussed below.
FIG. 3 illustrates a probability lattice for a word. The vertical axis corresponds to the concatenation of the states of the HMM for the phonemes that compose the word. Node 301 represents a final state for the word and contains the maximum probability of all the state paths that lead to that node. The emboldened lines of FIG. 3 represent the state path with the highest probability that ends at node 301. In certain applications (e.g., training a CSR system), it is helpful to identify the state path that has the highest probability of leading to a particular node. One well-known algorithm for identifying such a state path is the Viterbi algorithm. After the Viterbi algorithm has determined the highest probability state path to the final state, it is possible to backtrace from the final node in the lattice and determine the previous node on the highest probability state path all the way back to the starting state. For example, the state path that has the highest probability of ending at node 203 of FIG. 2 is "1, 2, 2, 2, 2, 3." When the probability lattice represents the phonemes that compose a word, then each state can be identified by the phoneme and the state within the phoneme.
The accuracy of a CSR system depends, in part, on the accuracy of the accuracy of the output and transition probabilities of the HMM for each phoneme. Typical CSR systems "train" the CSR system so that the output and transition probabilities accurately reflect speech of the average speaker. During training, the CSR system gathers codeword sequences from various speakers for a large variety of words. The words are selected so that each phoneme is spoken a large number of times. From these codeword sequences, the CSR system calculates output and transition probabilities for each HMM. Various iterative approaches for calculating these probabilities are well-known and described in Huang et al., "Hidden Markov Models for Speech Recognition," Edinburgh University Press, 1990.
A problem with such training techniques, however, is that such average HMMs may not accurately model the speech of people whose speech pattern is different than the average. In general, every person will have certain speech patterns that differ from the average. Consequently, CSR systems allow a speaker to train the HMMs to adapt to the speaker's speech patterns. In such training, CSR systems refine the HMM parameters, such as the output and transition probabilities and the quantization vectors represented by the codewords, by using training utterances spoken by the actual user of the system. The adapted parameters are derived by using both the user-supplied data as well as the information and parameters generated from the large amount of speaker-independent data. Thus, the probabilities reflect speaker-dependent characteristics. One such training technique is described in Huang and Lee, "On Speaker-Independent, Speaker-Dependent, and Speaker-Adaptive Speech Recognition," Proc. ICASSP '91, 1991, pp. 877-880.
A CSR system is typically trained by presenting a large variety of pre-selected words to a speaker. These words are selected to ensure that a representative sample of speech corresponding to each phoneme can be collected. With this representative sample, the CSR system can ensure that any HMM that does not accurately reflect the speaker's pronunciation of that phoneme can be adequately adapted. When additional training is performed, for example, because the speaker is not satisfied with accuracy of the recognition, the CSR presents additional pre-selected words to the speaker.
Although the use of pre-selected words can provide adequate training, the speakers may become frustrated with having to speak a large number of words. Indeed, since the words are pre-selected to include each phoneme, the speaker is effectively asked to speak words whose phonemes are modeled with an acceptable accuracy. It would, therefore, be useful to have a training system that could dynamically select words for training that will tend to optimize the accuracy of the training and reduce the number of words that a speaker is requested to speak.