In dialog where one individual is giving instructions to a second individual, the second individual will often repeat part or all of a previous instruction, sometimes merely to acknowledge correct receipt of the information but also sometimes to act as an abbreviated query which, in natural dialog, would result in repetition or clarification of the previous instruction. In English, this acknowledgment/query distinction is made with reference to intonation cues, typically a rising or falling pitch contour.
In an automated dialog system, even given accurate speech recognition, confusion would result if such a user's query is ignored, or if a user's acknowledging statement is misunderstood and results in the system needlessly repeating a previous instruction. One way of tackling this problem would be to try to constrain the user's responses, and in most current systems, this is the approach that is taken. However, if one wishes to move to natural open dialog then besides trying to determine what is said, one should also pay some attention to how it is said.
To illustrate, some possible interactions for a service which provides road navigation directions to users over a cellular phone are illustrated below. Assume the user has accessed the system and given the details of present location and destination. The system will then proceed to give directions. Ideally during this process, the system should be able to deal with user queries. For example, in response to the instruction "Turn right at main street", the system may have to distinguish between the responses:
(a) Do I turn right at main street? or
(b) so RIGHT? at main, or
(c) right at MAIN? or
(d) Okay, right at main.
In the first example, correct word recognition would result in the user's response being treated as a query. However, in the other examples, a correct response requires dealing with the intonational cues present in the speech signal. In fact, to correctly respond to (b) and (c), a system should not only decide that there is a query but also which item is being queried, i.e. is the direction or the location that is being queried?
The following papers generally describe methods and apparatus for classifying speech using intonational features: Wightman, C. W. and Ostandorf, M. "Automatic Recognition of Intonational Features" and Schmandt, C. "Understanding Speech Without Recognizing Words."