Dialog systems allow a user to interact with a system to perform tasks such as retrieving information, conducting transactions, planning, and other such problem solving tasks. A dialog system can use several input modalities for interaction with a user. Examples of input modalities include keyboards, touch screens, touch pads, microphones, gaze, video cameras, etc. Employing multiple modalities enhances user-system interactions in dialog systems. Dialog systems that use multiple modalities for user-system interaction are known as multimodal dialog systems. The user interacts with a multimodal system, using a dialog-based user interface. A set of interactions between the user and the multimodal dialog system is known as a dialog. Each interaction is referred to as a user turn.
A multimodal dialog system can be a verbal dialog system that accepts verbal inputs. A verbal dialog system includes an automatic speech recognition (ASR) modality, a handwriting modality, and any other modality that interprets user inputs into text inputs. Correct recognition of verbal inputs to a voice dialog system is important for reducing errors in interpreting the verbal inputs and subsequent actions taken. The verbal input is recognized by assigning confidence scores to words in the verbal input. These word confidence scores are further used to generate a confidence score for the verbal input. The verbal input is accepted or rejected based on the confidence score for the verbal input.
A known method of confidence scoring for use in speech understanding systems generates confidence scores at the phonetic, word and utterance levels. The method generates utterance level confidence scores based on word confidence scores. Another known method describes a robust semantic confidence score generator that calculates confidence score at the concept-level. The method applies various confirmation strategies directly at the concept level. Further, the method generates an N-best list for training a classifier.
Yet another known method for integrating multiple knowledge sources for utterance-level confidence annotation in the Carnegie Mellon University (CMU) communicator spoken dialogue system feeds features from speech recognition, parsing and dialog management to a learner such as artificial neural network (ANN). The CMU communicator is a telephone-based spoken dialog system that operates in the air-travel planning domain and provides the framework and the target platform for the development of the utterance-level confidence annotator.
However, each of the above methods has one or more of the following disadvantages. The generation of word confidence scores does not involve use of semantics. If the utterance level confidence score is low, it is rejected. However, if the utterance level confidence score is not low, a word graph with the rejected words is parsed to generate a parse score. Further, there is a lack of training data required for the training classifier.