1. Technical Field
The present disclosure relates to incremental speech recognition and more specifically to returning intermediate results from speech recognition while preparing a speech response.
2. Introduction
Spoken dialog systems often work in a turn-based configuration. In a turn-based configuration the user and the system take turns communicating with one another. For example, the system starts playing a prompt and stops when the prompt finishes or user speech is detected. The system waits for a reply or waits until the user stops speaking. If the user spoke, the system analyzes this speech and prepares a response. If no speech was detected from the user, the system can repeat the prompt, respond with a new prompt, or continue waiting.
A “barge-in” refers to when the user begins speaking prior to a prompt's completion. A barge-in can be problematic for many speech dialog systems because the system automatically starts and stops upon detection of user speech, without regard for what the user has said. This in turn results in false barge-ins, where the system stops a prompt when it should continue, and barge-in stutter, where the system and the user start talking at the same time, then both stop.
To solve the barge-in and related problems, one proposal is incremental speech processing. With incremental speech processing, the spoken dialog system continuously runs speech recognition and makes turn-taking decisions using the sequence of partial speech recognition results available while the user is speaking. One problem associated with incremental speech processing is that the partial speech recognition results are inherently unstable. Constant changes and revisions created as the system receives more speech result in expanding, shifting, and often spurious partial results. Waiting for more speech to arrive before producing partial results can, in some instances, add stability. Unfortunately, this action also produces latency, compromising the effectiveness of incremental processing.