Speaker dependent speech recognition systems use a feature extraction algorithm to perform signal processing on a frame of the input speech and output feature vectors representing each frame. This processing takes place at the frame rate. The frame rate is generally between 10 and 30 ms, and will be exemplified herein as 20 ms in duration. A large number of different features are known for use in voice recognition systems.
Generally speaking, a training algorithm uses the features extracted from the sampled speech of one or more utterances of a word or phrase to generate parameters for a model of that word or phrase. This model is then stored in a model storage memory. These models are later used during speech recognition. The recognition system compares the features of an unknown utterance with stored model parameters to determine the best match. The best matching model is then output from the recognition system as the result.
It is known to use a Hidden Markov Model (HMM) based recognition system for this process. HMM recognition systems allocate frames of the utterance to states of the HMM. The frame-to-state allocation that produces the largest probability, or score, is selected as the best match.
One problem with HMMs is that they assume an exponential distribution for the duration of a state. This is fundamental to the Markov process assumption, which assumes that the state transitions for frame F.sub.t are dependent only on the state of the system at frame F.sub.t-1. This model does not fit speech especially well. For this reason some modem recognition systems break the Markov assumption and assign state transition penalties which are related to the duration of a state.
In particular, it is known to simply bound the state duration to a minimum and maximum that are estimated during the training process. Thus a hard, bounded limit is set on the state duration such that a minimum number of frames are allocated to a state before transitions out of the state are allowed and once a maximum state dwell time is met, additional self loops are not allowed. Using state duration information in the determination of transition probabilities breaks the Markov process assumption, but typically yields better recognition results.
More complex systems having large amounts of training data can accurately model state transition probabilities as a function of the state duration. However for applications in which as few as two utterances are used to train an HMM, it is difficult to estimate accurate probability distributions for the state transition penalties because of the small amount of training data. Accordingly, the penalties may produce erroneous results.
Consequently there is a need for an improved system of using state duration information to generate transition penalties in a system having minimal training information.