1. Field of the Invention
The present invention is related to the field of Automatic Speech Recognition and Text To Speech systems, and more specifically to such systems, softwares and methods for improving “Kill on Barge-in” Response Time.
2. Description of the Related Art
Automatic Speech Recognition and Text To Speech (ASRTTS) systems perform a dual function. They input voice by a user, recognize it, and convert it into text. In addition, they “read out” text, by converting it to voice for the benefit of the user.
Sometimes a user will start speaking when text is being read, which is called a “Barge-In” event. The ASRTTS system must recognize that, and stop reading out text, to better allow the user to continue speaking.
Stopping in this instance is called “kill on barge in”. The response time of it is the duration from when the moment the barge-in event is detected, until the user no longer hears a voice.
The kill on barge in response time should be as small as possible. But it is not, due to various factors.
The response time is worse in ASRTTS systems where various components are distributed, and connected to each other via a network. One such example is described below.
Referring to FIG. 1, a distributed ASRTTS system is described. A Voice Interface Device VID includes a codec CC, and a jitter buffer JB. Also it may include a speaker SR for playout, and a microphone MP for receiving voice. The barge-in comes from microphone MP.
The ASRTTS system of FIG. 1 also includes a Voice Browser VB. A ARS/TTS application APPN1 may reside in Voice Browser VB.
The ASRTTS system of FIG. 1 also includes one or more of a Text To Speech (TTS) Media Server TTSMSP, a TTS Engine ETTSP, an Automatic Speech Recognition (ASR) Media Server ASRMS, and an ASR Engine EASR.
The ASRTTS system of FIG. 1 is distributed in that at least two of its components are separated by a network NT. In the embodiment of FIG. 1, at least the TTS Engine ETTSP is separated from Voice Interface Device VID by network NT. Network NT may be any network, such as the internet. Network NT is preferably configured under a Voice over Internet Protocol (VoIP). A connection SA1 is established between TTS Engine ETTSP and Voice Interface Device VID through network NT. Connection SA1 is also known as the media path. For text to speech, audio packets are sent from TTS Engine ETTSP to Voice Interface Device VID via connection SA1.
The problem with the distributed ASRTTS system can now seen more clearly. FIG. 1 represents the instant that, due to the detection of the barge in event, the ETTSP has stopped transmitting audio packets.
Even at that instant, however, it is already too late for some packets. Some audio packets APB are already stored in jitter buffer JB, and will be played out, thus prolonging the effective response time. And some straggler audio packets APS are already in network NT, within connection AS1. When they will reach jitter buffer JB, they will be played out, thus further prolonging the effective response time.
Packets APB and APS are thus latent with respect to the operation of stopping the generation of the audio packets. This latency is inherent in the use of network NT for a distributed ASRTTS system. It is desired to improve the effective response time.