1. Field of the Invention
The invention relates to the domain of speech based human-machine interaction. More precisely it relates to improving a spoken dialog system by incorporating prosodic information contained in the speech signal.
2. Description of Related Art
A spoken dialog system enables the communication between a human and a machine based on speech. Its main components are commonly at least one of a speech recognizer, a text-to-speech (TTS) system, a response generator, a dialog manager, a knowledge base and a natural language understanding module.
Human speech is not only the words spoken but also how they are spoken. This is reflected by prosody, i.e. rhythm, speed, stress, structure and/or intonation of speech, each of which can be used alone or in combination as prosodic cues. Also other features of an utterance can serve as prosodic cues.
Such prosodic cues play a very important role in human-human communication, e.g. they structure utterance in phrases (elements of a clause), emphasize novel information in an utterance, and differentiate between questions or statements.
Different approaches to extract prosodic cues have been proposed. Nevertheless prosodic information is rarely used in spoken dialog systems.
“Utterance” in spoken language analysis typically relates to a smallest unit of speech. It is generally but not always bounded by silence.
In many situations current speech interfaces perform not as desired or expected by a user. They frequently misunderstand words, especially when background noise is present, or when the speaking style differs from the one they expect. In particular they are insensitive to the prosody of speech, the rhythm, speed, stress, structure and/or intonation of speech.
Speech interfaces are already an important component for current cars but also in other areas such as, e.g., mobile telecommunication device control, and their importance will even increase in the future.
However, current speech interfaces are perceived as little intuitive and error prone. One reason for this is that such systems do not analyze speech as humans do. Especially, they are “blind” to prosodic cues. The integration of prosodic cues endows the systems with capabilities which allow them to better understand the user's goals. In particular, the integration of prosodic cues renders the systems more intuitive and hence more robust.
Stressing the words which are the most relevant is a very natural way of talking. By considering this fact, the human-machine dialog is much more natural and hence easier for the human.
As detailed below, this can be especially beneficial in situations where clarifications are necessary from the human. When clarifications occur or are necessary, current systems usually are unaware of which part of the utterance was misunderstood. In the subsequent interpretation of the correction by the human they hence decode the utterance in the same way as they decoded the original utterance. Humans however tend to emphasize the misunderstood term in the correction. By endowing the system with capabilities to extract this emphasis, i.e. the prominence, the system will obtain additional information and the dialog between human and machine will improve.
There is a multitude of approaches to extract prosodic cues in the context of speech processing (cf. documents [1-3]). When used to improve automatic speech understanding this information was extracted from speech corpora and used to improve an automatic transcription of these corpora (cf. documents [4-9]). A very recent system uses prosody to improve recognition scores in broadcast news recognition by scoring the words differently depending on their pitch accent [10]. The prosodic cues of the analyzed broadcast news in this case are solely determined based on the fundamental frequency
One prominent case of prosody use is the Verbmobil project (1993-2000) [11]. The target of the project was to allow people speaking different languages to communicate verbally with each other by help of a computer. Hence recognition of an utterance in a source language was performed, the recognized utterance was translated into the target language, which was then re-synthesized and output.
Prosodic cues were used to disambiguate different sentence meanings based on word prominence information and to guide the parsing of the sentence by using information on the prosodic phrasing. The cues deployed were based on fundamental frequency, intensity, and duration.
Other studies showed that the visual channel also conveys prosodic information, especially in the mouth region, the eyebrows and the head movements (cf. documents [12-16]). Few studies exist, where visual prosodic information is automatically extracted using markers in a speaker's face (cf. documents [17-18]).
Document U.S. Pat. No. 7,996,214 describes a system and method of exploiting prosodic features for dialog act tagging. The dialog act (question, hesitation, . . . ) is based on prosodic features.
Document US 2006/0122834 A1 describes an emotion detection device and a method for use in distributed systems. The emotional state of the user is inferred.
In U.S. Pat. No. 7,778,819 a method and apparatus is described for predicting word prominence in speech synthesis. Prominence is estimated from text and used it in speech synthesis.