Spoken language understanding systems have been deployed in numerous speech dialog applications which require some sort of interaction between humans and machines. The interaction usually is controlled by the machine which follows a pre-scripted dialog to ask questions of the users and then attempts to identify the intended meaning from their answers (expressed in natural language) and take actions in response to these extracted meanings. For example, FIG. 1 shows the basic functional arrangement of one specific form of a generic dialog system as described more fully in U.S. Pat. No. 7,424,428, incorporated herein by reference.
One important task for constructing effective speech dialog systems is referred to as “endpointing” which also known as Voice Activity Detection (VAD). Ideally, the VAD should be insensitive to background noises of various kinds including background speech, but sensitive to speech directed at the dialog system, especially in order to allow barge-in, i.e. detecting the user's speech during playing a system prompt. Various techniques exist based on energy and frequency, but these are still not optimal.