Intelligent automated assistants (or virtual assistants) provide a beneficial interface between human users and electronic devices. Such assistants allow users to interact with devices or systems using natural language in spoken and/or text forms. For example, a user can access the services of an electronic device by providing a spoken user request to a virtual assistant associated with the electronic device. The virtual assistant can interpret the user's intent from the spoken user request and operationalize the user's intent into tasks. The tasks can then be performed by executing one or more services of the electronic device and a relevant output can be returned to the user in natural language form.
Often, a spoken user request is commingled with various background noises. The background noises can include, for example, spurious conversation, music, mechanical noise, and environmental noise. To interpret the spoken user request, the virtual assistant can determine the beginning and ending of user speech within the audio input received. Detecting the beginning and ending of user speech is referred to as start-pointing and end-pointing, respectively. Start-pointing and end-pointing can be used to identify the portion of audio input that contains the spoken user request. Additionally, end-pointing can also be used to determine when to stop receiving audio input. In order for a virtual assistant to interpret and process audio input quickly and accurately, robust start-pointing and end-pointing is desired.
Conventional end-pointing algorithms rely on energy features such as short-time energy and zero-crossing rate, to distinguish user speech from background noise in an audio input. However, start-pointing and end-pointing can be significantly compromised when user speech overlaps with spurious background conversation. Spurious background conversation can also be referred to as babble noise. Babble noise can share the same frequency spectrum as user speech and thus can create co-channel interference, making it difficult to determine when user speech starts or ends within an audio input. Without accurate start-pointing and end-pointing, it can be difficult for a virtual assistant to accurately process audio input, which can lead to output errors, incorrect actions performed, and/or burdensome requests to clarify the user's intent.