Voice-based user interfaces are increasingly being used in the control of computers and other electronic devices. One particularly useful application of a voice-based user interface is with portable electronic devices such as mobile phones, watches, tablet computers, head-mounted devices, virtual or augmented reality devices, etc. Another useful application is with vehicular electronic systems such as automotive systems that incorporate navigation and audio capabilities. Such applications are generally characterized by non-traditional form factors that limit the utility of more traditional keyboard or touch screen inputs and/or usage in situations where it is desirable to encourage a user to remain focused on other tasks, such as when the user is driving or walking.
The computing resource requirements of a voice-based user interface, e.g., in terms of processor and/or memory resources, can be substantial. As a result, some conventional voice-based user interface approaches employ a client-server architecture where voice input is received and recorded by a relatively low-power client device, the recording is transmitted over a network such as the Internet to an online service for voice-to-text conversion and semantic processing, and an appropriate response is generated by the online service and transmitted back to the client device. Online services can devote substantial computing resources to processing voice input, enabling more complex speech recognition and semantic analysis functionality to be implemented than could otherwise be implemented locally within a client device. However, a client-server approach necessarily requires that a client be online (i.e., in communication with the online service) when processing voice input. Maintaining connectivity between such clients and online services may be impracticable, particularly in mobile and automotive applications where a wireless signal strength will no doubt fluctuate. Accordingly, when it is desired to convert voice input into text using an online service, a voice-to-text conversion session must be established between the client and the server. A user may experience significant latency while such a session is established, e.g., 1-2 seconds or more, which may detract from the user experience.