1. Field of the Invention
The present invention relates to spoken dialog systems and more specifically to exploiting information in an utterance for dialog act tagging.
2. Introduction
Speech acts or dialog acts, as understood by a person of ordinary skill in the art of spoken dialog systems, are characterizations of actions performed by a speaker during the course of a conversation or a dialog. This characterization provides a representation of conversational function and is especially useful in systems that require an automatic interpretation of dialog acts to facilitate a meaningful response or reaction. With the growing demand for integrated approaches to speech recognition, understanding, translation and synthesis, dialog act modeling has come to provide an important link in facilitating human-computer interactions.
Automatic interpretation of dialog acts has been addressed through two main approaches: first, the AI-style plan IS AN inferential interpretation of dialog acts that is designed through plan-inference heuristics; and second, the cue-based interpretation that uses knowledge sources such as, lexical, syntactic, prosodic and discourse-structure. Even though the plan-inference method can theoretically account for all variations in discourse, it is time-consuming in terms of manual design and computational overhead. On the contrary, data-driven cue-based approaches are computationally friendly and offer a reasonably robust framework to model and detect dialog acts automatically.
Automatic data-driven dialog act tagging is typically statistical in nature and uses various machine learning algorithms, known to skilled artisans. For example, machine learning algorithms useful for automatic data-driven dialog act tagging can include n-gram models, hidden markov models, maximum entropy models, neural networks, etc. Typically, these statistical models either use a flat chunk and label paradigm, or a hierarchical grammar-based framework to model the dependencies and relations among dialog turns. These statistical models can also exploit multiple knowledge sources in the form of lexical (word identity, keywords), syntactic (parts-of-speech, syntactic structure), prosodic (pitch contour, pitch accents, boundary tones) or discourse structure (dialog history) cues as features in the identification of dialog acts. In particular, prosody, the study of rhythm, intonation, and related attributes in speech, has been a very useful feature in automatic data-driven dialog act tagging. Prosody is domain-independent and can help to describe changes in the syllable length, loudness, pitch, and formant structure of speech sounds, as well as the tone, intonation, rhythm, and lexical stress of speech sounds. Prosody can also help to describe changes in the speech articulators, for example, the velocity and range of motion in articulators like the jaw and tongue, along with quantities like the air pressure in the trachea and the tensions in the laryngeal muscles. Prosody has received a fair amount of attention in cue-based dialog act tagging. Prosodic features such as parameterizations of the pitch contour, duration of segments, energy, as well as categorical representation of pitch accents and boundary tones have been successfully used to improve dialog act tagging.
Prosodic features have been used in dialog act tagging in three major ways: (i) raw/normalized pitch contour, duration and energy, or transformations thereof, (ii) discrete categorical representations of prosody through pitch accents and boundary tones, and (iii) parametric representation of pitch contour.
Prosodic decision trees have been used to model the raw/normalized prosodic features. In this context, duration, pauses, pitch and speaking rate features have been used as a prosodic feature vector. Such prosodic decision trees have resulted in dialog act detection accuracies of 38.9% on the Switchboard-DAMSL dataset, which has been extensively used for dialog act tagging. Of course, a dialog act detection accuracy of 38.9% is only marginally better than chance (35%). Using the original word transcripts in an n-gram modeling framework with ‘offline’ optimal decoding has resulted in dialog act detection accuracies of 72%. In other cases, symbolic representation of prosodic events have been employed as additional features in dialog act tagging within speech-to-speech translation systems.
Parametric representations of the pitch contour in dialog act classification have also been employed. On a subset of the Maptask corpus (DCIEM Maptask corpus), which has been used extensively for dialog act tagging, accuracies of 69% have been achieved using the parametric representation of intonation. Prosodic features have been shown to improve dialog act tagging accuracy marginally for automatically recognized transcripts, as prosodic features offer more discrimination compared to possibly incorrect lexical information from the ASR. In sum, the incorporation of prosodic features in dialog act tagging has not resulted in significant improvements over dialog act tagging based only on lexical and syntactic features.
Dialog act tagging has been successfully integrated in speech recognition, speech understanding, text-to-speech synthesis and speech translation systems. Several corpora with domain-specific annotation schemes have been created to facilitate automatic learning of dialog acts. A significant problem persists, however, because these corpora must be hand-labeled for each utterance with a domain-specific dialog act tag set.
Accordingly, what is needed in the art is an approved method for utilizing data information in audio for improved dialog act tagging.