The present invention relates to phonetic classification and speech recognition. In particular, the present invention relates to models used to perform automatic phonetic classification and speech recognition.
In phonetic classification and speech recognition, Hidden Markov Models (HMMs) have been used extensively to model the acoustics of speech. HMMs are generative models that use the concept of a hidden state sequence to model the non-stationarity of the generation of observations from a label. At each frame of an input signal, the HMM determines the probability of generating that state from each possible hidden state. This probability is determined by applying observed values derived from the frame of speech to a set of probability distributions associated with the state. In addition, the HMM determines a probability of transitioning from a previous state to each of the states in the Hidden Markov Model. Using the combined transition probability and observation probability, the Hidden Markov Model selects a state that is most likely to have generated the frame.
One limitation of Hidden Markov Models is that the probabilities of each state are determined using the same observed values, and thus the same collection of observed values are used against each state. This limitation is undesirable because different observed values are more important for certain speech sounds than for others. For example, when distinguishing vowel sounds from each other, the value of the formants are important. However, when distinguishing between fricatives, information as to whether the speech is voiced or unvoiced is informative. Thus, it would be desirable to be able to use different observed values for states associated with different speech sounds. However, HMM systems do not allow this.
In addition, HMM models do not allow a change in the length of between-frame dependencies for the observed values. Thus, at each frame, the observed values provide a fixed amount of information about previous frames. To help distinguish between speech sounds, it would be desirable to allow for different length frame dependencies for states associated with different speech sounds.
In the field of sequence labeling, conditional random field models have been used that avoid some of the limitations of Hidden Markov Models. In particular, conditional random field models allow observations taken across an entire utterance to be used at each frame when determining the probability for a label in the frame. In addition, different labels may be associated with different observed values, thereby allowing a better selection of observed values for each label.
One problem with CRF models is that they have required that the labels be known at the time of training. As such, the CRF models cannot model hidden states since the hidden states are unknown at training. As such, CRF models have not been used in speech recognition and phonetic classification.
Recently, an extension to conditional random field models has been suggested that incorporates hidden states. However, it has not been suggested or shown that this extension of the conditional random field models can be used in speech recognition or phonetic classification. In particular, the hidden states shown in the extension do not correspond to hidden states traditionally used in speech recognition, which are associated with particular phonetic units.