Intelligent automated assistants (or virtual assistants) provide an intuitive interface between users and electronic devices. These assistants can allow users to interact with devices or systems using natural language in spoken and/or text forms. For example, a user can access the services of an electronic device by providing a spoken user input to a virtual assistant associated with the electronic device. The virtual assistant can interpret the user's intent from the spoken user input and operationalize the user's intent into tasks. The tasks can then be performed by executing one or more functions of the electronic device and a relevant output can be returned to the user in natural language form.
In order for a virtual assistant to properly process and respond to a spoken user input, the virtual assistant can first identify the beginning and end of the spoken user input within a stream of audio input using processes typically referred to as start-pointing and end-pointing, respectively. Conventional virtual assistants can identify these points based on energy levels and/or acoustic characteristics of the received audio stream or manual identification by the user. For example, some virtual assistants can require users to input a start-point identifier by pressing a physical or virtual button before speaking to the virtual assistant or by uttering a specific trigger phrase before speaking to the virtual assistant in natural language form. In response to receiving one of these start-point identifiers, the virtual assistant can interpret subsequently received audio as being the spoken user input. While these techniques can be used to clearly identify spoken user input that is directed at the virtual assistant, interacting with the virtual assistant in this way can be unnatural or difficult for the user. For example, in a back-and-forth conversation between the virtual assistant and the user, the user can be required to input the start-point identifier (e.g., pressing a button or repeating the same trigger phrase) before each spoken user input.