Telecommunications service providers and other organizations which provide telephone-based services to remote customers or users have historically relied on human operators or agents to act as an interface between the customer or user and whatever instrumentality is used by the organization to actually provide the service. For example, telephone service providers have long provided enhanced telephone services of various sorts to customers using human operators. The operator receives from the customer a request for a service (e.g. credit card billing for a telephone call) and operates a suitable interface to the telephone network to cause the requested service to be provided. In some cases, the operator may directly deliver the requested service (e.g., by announcing to the customer a requested directory listing or the like). Banks, airlines, government agencies, and other organizations provide services to customers and users in a similar manner.
It is expensive to deliver services using human operators or agents. Many service transactions do not require complex interaction between the customer and the operator or agent. Accordingly, service providers have developed automated systems for providing many of the services previously executed through human operators or agents, thereby reducing costs and reserving human operators for transactions requiring human assistance such as those involving complex customer interaction. Many automated service systems require the customer to interact by pressing keys on the telephone, which is inconvenient for many customers.
Accordingly, service providers and others have sought automated speech recognition (ASR) systems capable of receiving interaction from customers or users via the spoken voice for use in providing telephone-based services to callers. In order for ASR systems to be broadly applicable, they must be "speaker-independent"--i.e., capable of accurately recognizing speech from a large plurality of callers without being exposed in advance to the speech patterns of each such caller. Many such systems have been developed. One approach to the construction of such a system employs two main components: a recognition component which, given a sample of speech, emits as a hypothesis the most likely corresponding translation from the recognition component's predefined vocabulary of speech units; and a verification component, which determines whether the speech sample actually contains speech corresponding to the recognition component's hypothesis. The utterance verification component is used to reliably identify and reject out-of-vocabulary speech and extraneous sounds.
Several technologies have been developed to implement the recognition component in ASR systems, and several technologies, some similar and others non-similar to those used in recognition, have been used to implement the utterance verification component. The particular recognition technology employed in an ASR system does not necessarily dictate the technology used for utterance verification. It is generally not apparent, a priori, whether a selected recognition technology may be advantageously used with a particular utterance verification technology, or how two candidate technologies may be usefully married to produce a working ASR system. Acceptable results have been obtained in ASR systems having recognition components which use acoustic speech models employing Hidden Markov Models (HMMs) as described in L. R. Rabiner and B. H. Juang, "An Introduction to Hidden Markov Models," IEEE ASSP Magazine, January 1986, pp. 4-16.
Various recognition and utterance verification components have employed models based on relatively large speech units, such as words or phrases. In a given ASR system, the utterance verification component typically employs speech units equivalent in size to that employed by the recognition component because the units output from the recognition component are supplied to the utterance verification component. U.S. Pat. No. 5,717,826, and R. A. Sukkar, A. R. Setiur, M. G. Rahim, and C. H. Lee, "Utterance Verification of Keyword Strings Using Word-Based Minimum Verification Error (WB-MVE) training," Proc. ICASSP '96, Vol. I, pp. 518-521, May 1996, disclose ASR systems providing utterance verification for keyword strings using word-based minimum verification error training.
Systems which employ large speech units generally require that the recognition component and the utterance verification component be trained for each speech unit in their vocabularies. The need for training for each speech unit has several disadvantages. In order for the ASR system to be speaker independent, speech samples for each large unit (e.g., whole words and/or phrases) must be obtained from a plurality of speakers. Obtaining such data, and performing the training initially, is resource intensive. Moreover, if a speech unit must be later added to the vocabulary, additional samples for that must be obtained from a plurality of speakers.
It is believed that most human languages employ a limited number of basic speech sounds which are concatenated to form words, and that speech in most such languages may be suitably represented by a set of basic speech sounds associated with that language. The basic speech sound units are often referred to as phonemes or "subwords." In order to avoid the disadvantages of ASR systems based on large speech units, there have been systems developed which are based on subwords. In subword-based systems, the results from the recognition component may be available as a string of recognized subwords, and a concatenated group of recognized subwords between two periods of silence may represent a word, phrase, or sentence. One of the main features of subword-based speech recognition is that, if the acoustic subword models are trained in a task independent fashion, then the ASR system can reliably be applied to many different tasks without the need for retraining. If the ASR system is to be used to recognize speech in a language for which it was not originally trained, it may be necessary to update the language model, but because the number of unique subwords is limited, the amount of training data required is substantially reduced.
It is generally not apparent, a priori, whether a selected recognition or utterance verification technology which works well for a given speech unit size may be advantageously applied to speech units of a different size. Moreover, the best ways of performing utterance verification on individual subwords, of applying the results therefrom in a meaningful way to words, phrases, or sentences formed by concatenating recognized subwords, and of training subword based utterance verification models, are still being explored.
Certain methods for task independent utterance verification have been proposed. For example in H. Bourlard, B. D'hoore, and J. -M. Boite "Optimizing Recognition and Rejection Performance in Wordspotting Systems," Proc. ICASSP '94. pp. 373-376, Vol. 1, April 1994, and in R. C. Rose and E. Lleida, "Speech Recognition Using Automatically Derived Acoustic Baseforms," Proc. ICASSP '97, pp. 1271-1274, April 1997, an "on-line garbage" likelihood is computed and a likelihood ratio is then formed between the "on-line garbage" likelihood and the likelihood of the recognized word, phrase, or sentence. In R. A. Sukkar, C. H. Lee, and B. H. Juang. "A Vocabulary Independent Discriminatively Trained Method for Rejection of Non-Keywords in Subword-Based Speech Recognition," Proc. Eurospeech '95, pp. 1629-1632, September 1995, a linear discriminator is defined and trained to construct a subword level verification score that is incorporated into a string (sentence) level verification score.
U.S. Pat. No. 5,675,706 also discloses an ASR system employing subword-based recognition and verification. The recognition stage employs subword-based HMM acoustic models. The utterance verification stage has a verifier which employs both a linear discriminator analyzer and HMM "antisubword" models to reject out-of-vocabulary sounds. Although the linear discriminator analyzer is discriminatively trained, the anti-subword HMM models are not discriminatively trained. Rather, each subword is assigned to one of a few subword classes, and an anti-subword model for each class is trained using all speech segments corresponding to sounds that are not modeled by any of the subwords in that subword class.
Another method that has been used is based on forming a likelihood ratio test between the likelihood of a free subword decoder and the likelihood of the recognized sentence. (See the above-cited papers by Sukkar, Lee, and Juang; and Rose and Lleida.)
U.S. Pat. No. 5,675,706 also discloses a verifier which uses a likelihood ratio test. A string-level verification stage determines for each string output by the recognition stage a first string-level verification score derived from the verification scores of each of the individual subwords from which the string is comprised, and a second anti-verification (or rejection) score derived from the anti-verification score of each subword. The subword verification and anti-verification scores are obtained using a linear discriminator. These scores are not likelihoods and thus cannot be directly used in a likelihood ratio test. In order to obtain likelihoods, the probability density functions associated with these scores are modeled by Normal distributions. The mean and variance of the verification function for each subword is estimated from the training set of sample sentences.
Although the aforementioned ASR systems may be sufficiently reliable for some applications, they are not sufficiently reliable for all applications, and even in applications for which existing systems are deemed adequate, improved recognition and verification accuracy is always desirable.