1. Technical Field
The present disclosure relates to speech recognition and more specifically to localized error detection (LED) and targeted clarification in a spoken language interface system.
2. Introduction
Most natural language dialog systems, upon encountering an error or misinterpretation, employ generic clarification strategies asking a speaker to repeat or rephrase an entire utterance. Human speakers, on the other hand, employ different and diverse clarification strategies in human-human dialog. Further, human speakers of different languages or cultures often use different types of clarification strategies. Targeted clarification questions can be categories into generic and targeted clarification questions. Consider the following exchange:
Speaker A: When did the problems with [power] start?
Speaker B: The problem with what?
Speaker A: Power.
Speaker B asks a targeted question that repeats the part of the utterance recognized correctly as context for the portion believed to have been misrecognized or simply unheard. Reprise questions are a type of a targeted clarification which echo the interlocutor's utterance, such as in Speaker B's query above. In human-human dialogs, reprise questions are much more common than non-reprise questions.
Generic questions are simply requests for a repetition or rephrasing of a previous utterance, such as “What did you say?” or “Please repeat.” Such questions crucially do not include contextual information from the previous utterance. Targeted question, on the other hand, explicitly distinguish the portion of the utterance which the system believes has been recognized from the portion it believes requires clarification. Besides requesting information, a clarification question also helps ground communication between two speakers by providing feedback that indicates the parts of an utterance that have been understood. In the above example, Speaker B has failed to hear the word power and so constructs a clarification question using a portion of the correctly understood utterance to query the portion of the utterance they have failed to understand. Speaker B's targeted clarification question signals the location of the recognition error to Speaker A. The targeted clarification question achieves grounding by indicating that the hearer understands the speaker's request for information about ‘the problem’ but has missed the problem description. In this case, Speaker A is then able to respond with a minimal answer to the question—filling in only the missing information. Current spoken dialog systems do not handle this type of error recovery in a manner comparable to that of human speakers.