The use of automatic speech recognition (ASR) technology is now commonplace in everyday life. One application of such technology is in Interactive Voice Response (IVR) systems. IVR systems are commonly used to automate certain tasks that otherwise would be performed by a human being. More specifically, IVR systems are systems which create a spoken dialog between a human speaker and a computer system to allow the computer system to perform a task on behalf of the speaker, to avoid the speaker or another human being having to perform the task. This operation generally involves the IVR system's acquiring specific information from the speaker. IVR systems may be used to perform very simple tasks, such as allowing a consumer to select from several menu options over the telephone. Alternatively, IVR systems can be used to perform more sophisticated functions, such as allowing a consumer to perform banking or investment transactions over the telephone or to book flight reservations.
Current IVR systems typically are implemented by programming standard computer hardware with special-purpose software. In a basic IVR system, the software includes a speech recognition engine and a speech-enabled application (e.g., a telephone banking application) that is designed to use recognized speech output by the speech recognition engine. The hardware may include one or more conventional computer systems, such as server-class computers, personal computers (PCs), workstations, or other similar hardware. These computer systems may be configured by the software to operate in a client or server mode and may be connected to each other directly or through a network, such as a local area network (LAN) or the Internet. The IVR system also includes appropriate hardware and software for allowing audio data to be communicated to and from the speaker through an audio interface, such as a standard telephone connection.
The speech recognition engine (or “recognizer”) recognizes speech from the speaker by comparing the speaker's utterances to one or more language models stored in a database. Two common types of language models used for this purpose are grammars and statistical language models (SLMs). At least for purposes of this document, the terms “grammar” and “SLM” have mutually exclusive meanings.
In this context, a “grammar” is a set of one or more words and/or phrases (“expressions”), i.e., sentence fragments, that a speaker is expected or required to utter in response to a corresponding prompt, and the logical relationships between those expressions. The logical relationships include the expected or required order of the expressions, and whether particular expressions are mandatory, optional, alternatives, etc. A recognizer may use various different grammars, according to the type of information required by the speech-enabled application. A grammar usually associates expressions to “tags” that represent meaningful pieces of information in the context of the speech application. A grammar is typically expressed in some form of grammar specification language, such as the Nuance Grammar Specification Language (GSL) or the grammar syntax specified by the Speech Recognition Grammar Specification Version 1.0, W3C Consortium, Mar. 16, 2004.
An SLM, on the other hand, is a model which assigns probabilities to words or sequences of words, i.e., probabilities that the words or word sequences will occur in a given speech context. An SLM is normally generated by applying a set of training data, or “corpus”, to an SLM training algorithm, called an “SLM trainer”. Examples of such algorithms are well-known in the art. The corpus can be a set of sample words and/or phrases (“transcriptions”) that a speaker can say (or has said) in a given context or application. In that case, the SLM is a “word-based SLM”. In general, the larger the corpus is, the better the quality the resulting SLM will be.
A “class-based SLM” is an SLM in which one or more of the word sequences have been replaced by a rule, which is called a “class”. This approach is useful, for example, when the amount of training data for the SLM is limited. A class-based SLM allows certain words or sequences in the SLM to be grouped together and generalized, such as by using the general term (class) “City” to replace specific city names. Each class in a class-based SLM is defined by a separate grammar or SLM.
A central problem in the deployment of a successful commercial speech application is the upfront cost of developing the initial application. High accuracy is required “out-of-the-box” in order to satisfying the end customer, but this often requires pilot phases and speech scientists to optimize the accuracy of each component of the application. Consequently, it is very desirable to have reusable components that exhibit cross-application robustness, yet this is difficult to achieve.
To that end, ASR vendors commonly package generic grammars for collection of certain canonical pieces of information, such as date, time, digit strings, dollar amounts, etc. ASR vendors also supply acoustic models that are task-independent. SLMs, however, tend to be specific to the domain of the target application (task), so cross-application robustness is much harder to achieve for SLMs.
The portability of language models across applications has been extensively studied in the context of conversational speech recognition. There have been two main approaches to attempting to provide cross-domain (cross-application) robustness of language models: 1) to train the language model with a large amount of data from different sources with the hope that the resulting language model will be a good representation of the general language; or 2) to interpolate or adapt language models with domain-specific language models to improve language models using limited in-domain resources.
Class-based SLMs have been studied in the context of directed dialog commercial speech applications. However, all known studies relied on data from the target application to train the class-based SLMs and did not investigate the generic nature of class-based SLMs. One problem with this approach is that it is sometimes difficult and labor intensive to acquire a training corpus of sufficient size and quality for a particular target application.