1. Technical Field
The present disclosure relates to speech dialog systems and more specifically to using a multi-agent architecture to determine what utterance should be output by the speech dialog system and when/if the utterance should be output.
2. Introduction
Spoken dialogue systems are quickly becoming a component of everyday life and turn-taking, the transition of speaking roles in a dialogue, is a critical. Research in human-human turn-taking has shown that the turn-release in human-human dialogue can be predicted and that both prosodic and contextual features are important to predicting turn-transitions. Consistent with this, much work has emphasized “smooth” human-machine turn-taking, where the system should not plan to speak until the user has finished speaking. This work has focused on predicting the user's turn-release by either contextual or prosodic features. Other work has focused on systems that can explicitly overlap system speech with user speech to either preempt the full user utterance or produce a backchannel. These approaches leverage incremental speech recognition and some aspects of reinforcement learning. However, with some exceptions, the majority of machine turn-taking approaches have focused on modeling the surface features of human turn-taking and initiative, such as speaking in turns and how to react to interruptions, rather than the relative importance of the utterances being received and delivered.