Intelligent automated assistants (or digital assistants) provide a beneficial interface between human users and electronic devices. Such assistants allow users to interact with devices or systems using natural language in spoken and/or text forms. For example, a user can access the services of an electronic device by providing a spoken user request to a digital assistant associated with the electronic device. The digital assistant can interpret the user's intent from the spoken user request and operationalize the user's intent into tasks. The tasks can then be performed by executing one or more services of the electronic device and a relevant output can be returned to the user in natural language form.
Digital assistants can interpret user intent by means of natural language processing. In particular, the user's speech input can be parsed to determine the semantic intent that is most likely implicated by the speech input. To interpret the spoken user request, the digital assistant can determine the beginning and ending of user speech within the audio input received. Detecting the beginning and ending of user speech is referred to as start-pointing and endpointing, respectively. Start-pointing and endpointing can be used to identify the portion of audio input that contains the spoken user request. Additionally, endpointing can also be used to determine when to stop receiving audio input. For a digital assistant to interpret and process audio input quickly and accurately, robust endpointing is desired.
Conventional endpointing algorithms typically rely on energy features such as short-time energy, and zero-crossing rate to distinguish user speech from background noise in an audio input. However, endpointing can be significantly compromised when user speech overlaps with spurious background conversation. Spurious background conversation can also be referred to as babble noise. Babble noise can share the same frequency spectrum as user speech and thus can create co-channel interference, making it difficult to determine when user speech starts or ends within an audio input. Without accurate endpointing, it can be difficult for a digital assistant to accurately process audio input, which can lead to output errors, incorrect actions performed, and/or burdensome requests to clarify the user's intent. Further, different users have different speech characteristics, and conventional endpointing cannot take those differences into account. Users who speak slowly, or include long pauses at certain points in their speech, may find that natural pauses in their speech cause an energy-based endpointer to determine prematurely that user speech has concluded.