1. Field of the Invention
Embodiments described herein are directed to a spoken dialog system using a best-fit language model and best-fit grammar. Specifically, the dialog system selects both a best-fit language model among a general-task language model and numerous dialog-state dependent language models and a best-fit grammar among a general-purpose grammar and numerous dialog-state dependent sub-grammars.
2. Related Art
A large vocabulary continuous speech recognizer (“LVCSR”) is a key component of a spoken dialog system. A LVCSR's performance directly affects dialog system performance. Almost all LVCSR systems use language models (“LM”) to improve recognition accuracy. LMs for continuous speech recognition are usually built from a large set of training sentences in a specific domain. Current LVCSRs in spoken dialog systems typically use a single LM, which covers a general task. The most commonly used LM is the statistical LM (“SLM”), i.e., n-grams.
The n-gram represents basic probabilities of occurrences of n-word sequences. N-grams are task dependent and can be instrumental in improving recognition accuracy. The standard n-gram LM captures the structure of a spoken language by assigning probabilities to words conditioned on n−1 preceding words. The value of n is usually kept low (two or three), since the number of parameters increases exponentially with n, and the training data is sparse in the early phases of system development. Thus, standard n-gram LMs do not model longer distance correlations. They also do not take advantage of linguistic knowledge or structure other than that covered within n-word sequences.
The single LM method does not employ dialog-state dependent LMs to improve recognition accuracy. That is, language modeling for speech recognizer in dialog systems may take one of several forms. Human input can be constrained through a directed dialog, allowing a decoder to use a state-specific LM to improve recognition accuracy. In this way, dialog states are used to partition a whole set of utterances into subsets and then train standard n-gram LMs from each partitioned set. Recent research articles have reported a use of dialog-state dependent LMs for dialog systems. Yet, there lacks a way to prevent these dialog-state dependent LMs from over-specializing such that user utterances that are not categorized into the current state are penalized.
As such, a new method that uses likelihood scores from the LVCSR to select the best-fit LM among a general-task LM and the dialog-state dependent LMs is needed. This new method takes advantage of dialog-state dependent LMs to improve recognition accuracy and, at the same time, to avoid the over-specialization of LMs by using the general-task LM.
Most spoken dialog systems further use grammar to improve their performance. Grammar can be used by either the speech recognition engine or the parser, the language understanding module. The grammar specifies the phrase/sentence patterns that are allowed to pass the recognition engine or the parser. While a grammar that consists of a limited number of patterns can be written relatively easily, such a grammar only allows a user to speak in a limited number of ways. Such a deficiency may result in a disappointing user experience. Writing a grammar that includes all possible spoken patterns by all possible users is nearly impossible, however. In addition, a complex grammar would be more likely to generate ambiguities, i.e., one user input may match multiple grammar patterns, which may have different understandings of the user input.
Current spoken dialog systems typically use a single grammar. Several problems are related to such use. First, it is difficult for a single grammar to cover all utterance patterns, even for moderate tasks. The complexity makes the grammar writing time consuming and tedious. Moreover, if the grammar does not cover a sufficient number of utterance patterns, the probability of the user being rejected increases, despite a correct query/response. Second, the likelihood that the same user utterance will be matched with multiple patterns in a grammar increases as complexity increases, causing problems and ambiguities. Third, every grammar task is dependent and thus not portable across different tasks. That is, a new grammar must be written for each new task.
A best-fit grammar strategy to improve the performance and user experience of dialog systems and make grammar writing less complex is thus needed to solve the above-described problems. Allowing a dialog system to choose the best-fit grammar from a general-purpose grammar and dialog-state dependent sub-grammars will prove beneficial.