Speaker dependent speech recognition systems use a feature extraction algorithm to perform signal processing on a frame of the input speech and output feature vectors representing each frame. This processing takes place at the frame rate. The frame rate is generally between 10 and 30 ms, and will be exemplified herein as 20 ms in duration. A large number of different features are known for use in voice recognition systems.
Generally speaking, a training algorithm uses the features extracted from the sampled speech of one or more utterances of a word or phrase to generate parameters for a model of that word or phrase. This model is then stored in a model storage memory. These models are later used during speech recognition. The recognition system compares the features of an unknown utterance with stored model parameters to determine the best match. The best matching model is then output from the recognition system as the result.
It is known to use a Hidden Markov Model (HMM) based recognition system for this process. HMM recognition systems allocate frames of the utterance to states of the HMM. The frame-to-state allocation that produces the largest probability, or score, is selected as the best match.
Many voice recognition systems do not distinguish between valid and invalid utterances. Rather, these systems choose one of the stored models which is the closest match. Some systems use an Out-of-Vocabulary rejection algorithm which seeks to detect and reject invalid utterances. This is a difficult problem in small vocabulary, speaker dependent speech recognition systems due to the dynamic size and unknown composition of the vocabulary. These algorithms degrade under noisy conditions, such that the number of false rejections under noisy conditions increases.
In practice, out-of-vocabulary rejection algorithms must balance performance as measured by correct rejections of invalid utterances and false rejections of valid utterances. The false rejection rate can play a critical role in customer satisfaction, as frequent false rejections, like incorrect matches, will cause frustration. Thus, out-of-vocabulary rejection is a balance of meeting user expectations for recognition.
Accordingly it is known to calculate a rejection threshold based upon the noise level. For example, it is known to measure the noise level prior to the detection of the first speech frame. A threshold is calculated from the measurement. An input is rejected if the difference between the word reference pattern and the input speech pattern is greater than the rejection threshold. Such a system is thus dependent upon an arbitrary noise input level. Such measurement can not be relied upon to produce a meaningful rejection decision.
Accordingly, there is a need for an improved method of providing a basis for rejecting utterances in a voice recognition system.