Voice dialog systems are widely used in various applications, such as in-vehicle communication, travel reservation and inquiry system. A basic goal of the voice dialog systems is to understand intentions of a user from the speech, and to execute commands based on the intentions.
The voice dialog systems can be rule-based or statistical. The rule based dialog system operates based on a single hypothesis regarding the current state of the dialog, which represents what the system so far has determined from the user's speech about the user's intentions, and contains pre-defined system actions for each dialog state, such as prompting the user for more information or carrying out the user's request. Speech recognition and natural language understanding are used to determine the transition from one dialog state to another, following the action. This transition is governed by deterministic rules, which allow transitions only to a pre-defined set of states.
In a statistical dialog model, the system uses a probabilistic model to represent its knowledge of the user's possible intentions. The system thus considers multiple hypotheses of the user's intention and the corresponding results of voice command recognition. The optimum system response is determined based on probabilities of the hypothesis, and the recognition result can be a subject to a confirmation process such that the intention of a command can be better determined or confirmed.
Dialogs speech often exhibit ambiguities. In addition, the semantic meaning or the intent of the speech often cannot be inferred even when the literal meaning is understood. In the language based systems, such ambiguities can cause degradation of the system performance. For instance, the intent of the sentence “lower the volume” in a home entertainment environment can be ambiguous even though the literal meaning of the spoken words can be well understood. In this particular example, the ambiguity can be due to the fact that there are several appliances in the same household whose volume can be controlled but the spoken sentence does not explicitly indicate, which appliance's volume is to be lowered.
In a voice-based application, different semantic meanings of a dialog can be mapped to different actions. Misunderstanding the semantic meaning or the intent of a command often leads to errors that decrease system performance and causes user's frustration and dissatisfaction.
Solutions to this problem include increasing a number of states in the statistical model or providing more sophisticated error handling model. See, e.g., U.S. 2012/0173244. However, the increase of the number of the states can negatively effect the performance of the voice dialog system with statistical dialog model.