In phonetic classification and speech recognition, Hidden Markov Models (HMMs) have been used extensively to model the acoustics of speech. HMMs are generative models that use the concept of a hidden state sequence to model the non-stationary of the generation of observations from a label. At each frame of an input signal, the HMM determines the probability of generating that frame from each possible hidden state. This probability is determined by applying a feature vector derived from the frame of speech to a set of probability distributions associated with the state. In addition, the HMM determines a probability of transitioning from a previous state to each of the states in the Hidden Markov Model. Using the combined transition probability and observation probability, the Hidden Markov Model selects a state that is most likely to have generated the frame.
One limitation of Hidden Markov Models is that the probabilities of each state are determined using the same feature vectors, and thus the same collection of features are used against each state. This limitation is undesirable because different features are more important for certain speech sounds than for others. For example, when distinguishing vowel sounds from each other, the value of the formants are important. However, when distinguishing between fricatives, information as to whether the speech is voiced or unvoiced is informative. However, HMM systems do not allow the system to be able to use different features for states associated with different speech sounds.
In addition, HMM models do not allow a change in the length of between-frame dependencies for the features. Thus, at each frame, the features provide a fixed amount of information about previous frames. To help distinguish between speech sounds, current systems do not allow for different length frame dependencies for states associated with different speech sounds. Also, HMM models do not allow any flexibility in the amount of acoustic data that is summarized in each frame. Typically, frames are generated by analyzing 25 millisecond segments of acoustic waveform. This is a compromise between the long time scales required for frequency analysis of voiced sounds such as vowels and the short time scales required for reliably detecting short sounds such as plosives.
In the field of sequence labeling, conditional random field models have been used that avoid some of the limitations of Hidden Markov Models. In particular, conditional random field models allow observations taken across an entire utterance to be used at each frame when determining the probability for a label in the frame. In addition, different labels may be associated with different features, thereby allowing a better selection of features for each label.
One problem with CRF models is that they have required that the states be known at the time of training. As such, the CRF models cannot model hidden states since the hidden states are unknown at training. As such, CRF models have not been used in speech recognition and phonetic classification.
Recently, an extension to conditional random field models has been suggested that incorporates hidden states. However, it has not been suggested or shown that this extension of the conditional random field models can be used in speech recognition or phonetic classification. In particular, the hidden states shown in the extension do not correspond to hidden states traditionally used in speech recognition, which are associated with particular phonetic units.
In addition, training CRF-type models presents some difficulties. Some techniques for training such models include the estimation maximization (EM) algorithm that uses an iterative scaling approach such as Generalized Iterative Scaling (GIS) or a batch level gradient-based approach such as the limited-memory Broyden-Flecher-Goldfarb-Shanno (L-BFGS) update approach. These training methods are batch methods that process all of the training data once in each iteration of training before updating the model parameters. Many iterations are usually required to reach a desired level of performance. Thus, training can be slow and cumbersome.
The discussion above is merely provided for general background information and is not intended to be used as an aid in determining the scope of the claimed subject matter.