The present invention relates to pattern recognition. In particular, the present invention relates to speech recognition.
A pattern recognition system, such as a speech recognition system, takes an input signal and attempts to decode the signal to find a pattern represented by the signal. For example, in a speech recognition system, a speech signal (often referred to as a test signal) is received by the recognition system and is decoded to identify a string of words represented by the speech signal.
Many speech recognition systems utilize Hidden Markov Models in which phonetic units, which are also referred to as acoustic units or speech units, are represented by a single tier of connected states. Using a training signal, probability distributions for occupying the states and for transitioning between states are determined for each of the phonetic units. To decode a speech signal, the signal is divided into frames and each frame is transformed into a feature vector. The feature vectors are then compared to the distributions for the states to identify a most likely sequence of HMM states that can be represented by the frames. The phonetic unit that corresponds to that sequence is then selected.
Although HMM-based recognition systems perform well in many relatively simple speech recognition tasks, they do not model some important dynamic aspects of speech directly (and are known to perform poorly for difficult tasks such as conversational speech). As a result, they are not able to accommodate dynamic articulation differences between the speech signals used for training and the speech signal being decoded.
Alternatives to HMM systems have been proposed. In particular, it has been proposed that the statistically defined trajectory or behavior of a production-related parameter of the speech signal should be modeled directly. Since the production-related values cannot be measured directly, these models are known as Hidden Dynamic Models (HDM). Hidden Dynamic Models are one example of a class of models known as switching state space models, which provide two types of hidden states. The two types of hidden states form two first order Markov chains, where the continuous chain is conditioned on the discrete one.
One problem with switching state space models is that it is difficult to train them because common training algorithms, such as the Expectation-Maximization algorithm, become intractable for switching state space models. In particular, this computation increases exponentially with each additional frame of the speech signal.
Thus, a training system is needed that allows the parameters of a switching state space dynamic model to be trained efficiently.