1. Field of the Invention
This invention relates to multi-state barge-in-models in general and, more particularly, the present invention provides a method and a system for discriminative training of the multi-state barge-in-models for speech processing.
2. Introduction
Speech processing technologies have since their inception been involved, in some way or another, with the problem of detecting speech, whatever the acoustic environment. The problem of accurately distinguishing speech from the background is still an active area of research.
In practice there are three different applications involving speech detection. They differ in their intent and the mechanisms used to achieve their targets. The first application determines if speech is present, it is commonly referred to as the Voice Activity Detection (VAD). The VAD application tries to detect every non-speech segment within a continuous utterance, for example, a short pause. Another application, most commonly encountered in automatic speech recognition (ASR) applications is the problem of endpointing. This is important when detecting the beginning and the end of an utterance, the ASR system is relied on to internally determine if there are any utterance internal pauses.
Barge-in is a unique speech detection problem that only occurs in dialog based applications. Barge-in happens when a user of an automated dialog system attempts to input speech during the playback/synthesis of a prompt generated by the dialog system. In this unique situation, two things are expected to occur, virtually instantaneously. First the prompt is immediately terminated, both to indicate to the user that the system is listening to him/her, and to allow uninterrupted recognition of the user's utterance. At the same time, the ASR engine starts processing the accumulated speech starting some short amount of time prior to the detected barge-in. In the case of barge-in, the system faces only a relatively small subset of the problems faced by the VAD systems. Conversely, the errors can have a significant impact to the perceived usability of the system and might cause it to be abandoned. A false barge-in, which happens when the system incorrectly believes that there is speech input by the user, will terminate the prompt. This termination of the prompt leaves the user without proper guidance for providing the appropriate input to the system. This can have a long term effect diverting the dialog away from the intended operation for many turns. Conversely, if by trying to minimize false alarms, the system becomes less sensitive to speech input and fails to barge-in, the user may find it uncomfortable speaking while the prompt is still active. The user's discomfort corrupts their delivery of the speech input affecting the ASR due to the unnaturalness of the input. In addition this often leads to unwanted echo and consequent poor recognition performance. This is assuming the ASR system is left active all the time, and not initiated by the barge-in detection, in which case the speech would be lost to the system.
The ideal barge-in response requires minimum latency, responding to the speech input as quickly as possible, while requiring high level of accuracy in detecting speech. Those two criteria are contradictory and are often traded off one against the other.
The overall dialog system scenario implies, to a large extent, that the barge-in performance is tightly coupled with the ASR system. In essence, a flawless barge-in performance that negatively impacts the ASR performance is detrimental to the system performance, and vice-versa. In many ways the best barge-in system is the ASR system, with the serious drawback that its latency is too long. Accordingly, what is needed in the art is to match the barge-in performance to the ASR performance to minimize such possible differences by using the ASR technology to provide the barge-in processing.