1. Field of Invention
This invention is directed to automated speech recognition systems. In particular, this invention is directed to a method and an apparatus for classifying and verifying recognition hypotheses of speech input to the automated speech recognition system. More particularly, this invention is directed to a system which uses multiple confidence measures in an integrated classification and verification subsystem.
2. Description of Related Art
Flexible and robust automated speech recognition systems have long been sought. As shown in FIG. 1, the current paradigm for automated speech recognition systems is to convert spoken words into spectral coefficients and then input those spectral coefficients into a speech recognition subsystem that generates recognition hypotheses. The recognition hypotheses are generated based on some arbitrarily selected confidence measure (CM) so that the speech recognition subsystem outputs as the recognized unit of speech the recognition hypothesis which most closely matches the criteria of the confidence measure. The recognized unit of speech can be a phoneme, a string or a word or the like. The recognition hypothesis output by the speech recognition subsystem is input to a verification subsystem, which attempts to verify that the recognition hypothesis output by the speech recognition subsystem for the current set of spectral coefficients is correct.
In particular, hidden Markov models (HMMs) have been used to implement the speech recognition and verification subsystems. HMMs have allowed speech recognition systems to accommodate spontaneous speech input. Although this capability facilitates a friendlier user interface, it also poses a number of problems, including out-of-vocabulary words, false starts, disfluency, and acoustical mismatch. Thus, automated speech recognition systems must be able to detect and recognize "keywords", i.e., the words of the vocabulary of the automated speech system, while rejecting "non-keywords." In general, automated speech recognition systems have limited vocabularies, such as digits and/or user-added names in an automated voice dialing system.
Automated speech recognition (ASR) systems that are able to spot keywords allow users the flexibility to speak naturally without needing to follow a rigid speaking format. Utterance verification (UV) technology is desirable in such automated speech recognition systems. As described in B.-H. Juang, et al., "Minimum Classification Error Rate Methods for Speech Recognition," IEEE Transactions on Speech and Audio Processing, Vol. 5, No. 3, May 1997, pages 257-265 (Juang) (herein incorporated by reference in its entirety) and M.G. Rahim, et al., "Discriminative Utterance Verification for Connected Digits Recognition," IEEE Transactions on Speech and Audio Processing, Vol. 5, No. 3, May 1997, pages 266-277 (Rahim 1) (herein incorporated by reference in its entirety), significant progress has been made in utterance verification (UV) for unconstrained speech using HMMs. Utterance verification (UV) systems introduce a filler (or garbage) model for enhancing keyword detection and absorbing out-of-vocabulary speech. Filler HMMs also allow the false alarm rate (i.e., the false positive or erroneously verified keyword rate) to be reduced through keyword verification following detection and segmentation of speech into keyword hypotheses by the speech recognition subsystem.
As described in Juang, HMM-based speech recognition can be efficiently implemented using a minimum classification error (MCE) training method that minimizes either the empirical error rate or the expected error rate, given an arbitrary choice of the distribution (discriminant) function, rather than the traditional maximum likelihood (ML) method that is based on the distribution estimation formulation. One problem when using HMMs is the evaluation problem. Given an observation sequence (or a set of sequences) X, the estimation problem involves finding the "right" model parameter values that specify a source model most likely to produce the given sequence of observations.
The MCE approach to solving the estimation problem involves finding a set of parameters .LAMBDA. that minimize a predetermined loss measure, such as the expected loss or the empirical loss. Various minimization algorithms, such as the generalized probabilistic descent (GPD) algorithm, can be used to minimize the expected loss. In the GPD-based minimization algorithm, the expected loss is minimized according to an iterative procedure. However, the underlying probability distributions involved in minimizing the expected loss are often unknown. However, MCE is designed only to minimize the recognition error, and is not generally concerned with utterance verification.
In the MCE training method, an utterance observation X is assumed to be one of M classes. For recognition of continuous speech or for speech recognition using subword model units, X is usually a concatenated string of observations belonging to different classes. For example, a sentence is a sequence of words, each of which are to be modeled by a distribution. In this situation, one possible training criterion is to minimize the string error rate of the string models constructed from concatenating a set of word or substring models. A MCE-trained HMM generates a word sequence label W for an observation sequence X that minimizes the classification error rate.
Once the speech recognition system has nominally recognized the observation sequence and generated a word sequence for the observation sequence, utterance verification attempts to reject or accept part or all of a nominally recognized utterance based on a computed confidence score. Utterance verification also attempts to reject erroneous but valid keyword strings (i.e., "putative errors"). Utterance verification is particularly useful in situations where utterances are spoken without valid keywords, or when significant confusion exists among keywords, thus resulting in a high substitution error probability.
To deal with these types of problems, automated speech recognition systems must be able to both correctly recognize keywords embedded in extraneous speech and to reject utterances that do not contain valid keywords or keyword hypotheses that have low confidence scores. Rahim 1 describes a HMM-based verification subsystem that computes a confidence measure that determines whether or not to reject recognized strings. Rahim's verification method and apparatus tests a "null" hypothesis that a given keyword or set of keywords exist within a segment of speech and are correctly recognized against alternative hypotheses that the given keyword or set of keywords does not exist or is incorrectly classified within that speech segment. In Rahim 1, the MCE training method is used to train the HMM-based verification subsystem.
In the HMM-based verification subsystem described in R. A. Sukkar, et al., "Utterance Verification of Keyword Strings Using Word-Based Minimum Verification Error (WB-MVE) Training", Proceedings of the International Conference on Acoustics, Speech and Signal Processing, IEEE, Piscataway, N.J. (1996) (Sukkar) (herein incorporated by reference in its entirety) and M. G. Rahim, et al., "String-Based Minimum Verification Error (SB-MVE) Training for Speech Recognition", Computer Speech and Language (1997) 11, pages 147-160, Academic Press, Ltd. (Rahim 2) (herein incorporated by reference in its entirety), the HMMs are trained using a minimum verification error (MVE) training method rather than the minimum classification error (MCE) training method.
Although MCE training reduces the recognition error rate as well as the verification error rate, the objective function used in recognition training is not consistent with that for utterance verification training. In contrast to MCE, in minimum verification error (MVE) training, a misverification measure is used to minimize the expected verification error rates, thus reducing the combined false alarm rate and the false rejection (i.e., false negative) rate. In particular, MVE is used to adapt the parameters of the verification HMMs.
FIG. 2 shows, in greater detail, a basic architecture of the two-stage system shown in FIG. 1. In the first stage, recognition is performed via Viterbi beam search using a set of recognition HMMs 126. These recognition HMMs 126 are trained by adjusting the parameters .LAMBDA. of the recognition HMMs 126 using maximum likelihood estimation followed by string-based minimum classification error (MCE) training. During recognition, each utterance is segmented into keyword hypotheses and is then passed to the verification subsystem 130.
In the second stage, each keyword hypothesis is verified using a set of verification HMMs 134. These verification HMMs 134 are initially trained using maximum likelihood estimation followed by string-based minimum verification error (MVE) training. During verification, a hypothesis is tested over the entire utterance, resulting in a confidence score. The utterance is rejected if the confidence score is below a predetermined operating test threshold. The verification HMMs 134 include keyword models, which model correctly recognized keywords, anti-keyword models, which correspond to incorrect recognition of one keyword as another keyword, and a filler model, which corresponds to out-of-vocabulary words.
However, while confidence measures, such as MCE and MVE, have reduced the error rates of the recognition and verification subsystems, respectively, each of these confidence measures are implemented independently of each other. Thus, there is no consistent way to combine the various confidence measures, whether in training the HMMs or in testing the input speech. Even when the various confidence measures have been combined, they have only been combined on an ad hoc basis for the particular use to which the automated speech recognition system was to be put, and then only for testing the input speech, not for training the HMMs.
However, even with discriminative (i.e., MCE and MVE) training techniques, the likelihood ratio test used for UV cannot be made most optimal. That is, due to assumptions made regarding the hypotheses' probability density functions and the inability to estimate, for each hypotheses, the probability density function's parameters exactly, the likelihood ratio test used for discriminative (i.e., MCE and MVE) training techniques is not guaranteed to be the most powerful test for UV.