Voice-based user interfaces are increasingly being used in the control of computers and other electronic devices. One particularly useful application of a voice-based user interface is with portable electronic devices such as mobile phones, watches, tablet computers, head-mounted devices, virtual or augmented reality devices, etc. Another useful application is with vehicular electronic systems such as automotive systems that incorporate navigation and audio capabilities. Such applications are generally characterized by non-traditional form factors that limit the utility of more traditional keyboard or touch screen inputs and/or usage in situations where it is desirable to encourage a user to remain focused on other tasks, such as when the user is driving or walking.
Voice-based user interfaces have continued to evolve from early rudimentary interfaces that could only understand simple and direct commands to more sophisticated interfaces that respond to natural language requests and that can understand context and manage back-and-forth dialogs or conversations with users. Many voice-based user interfaces incorporate both an initial speech-to-text (or voice-to-text) conversion that converts an audio recording of a human voice to text, and a semantic analysis that analysis the text in an attempt to determine the meaning of a user's request. Based upon a determined meaning of a user's recorded voice, an action may be undertaken such as performing a search or otherwise controlling a computer or other electronic device.
The computing resource requirements of a voice-based user interface, e.g., in terms of processor and/or memory resources, can be substantial, and as a result, some conventional voice-based user interface approaches employ a client-server architecture where voice input is received and recorded by a relatively low-power client device, the recording is transmitted over a network such as the Internet to an online service for speech-to-text conversion and semantic processing, and an appropriate response is generated by the online service and transmitted back to the client device. Online services can devote substantial computing resources to processing voice input, enabling more complex speech recognition and semantic analysis functionality to be implemented than could otherwise be implemented locally within a client device. However, a client-server approach necessarily requires that a client be online (i.e., in communication with the online service) when processing voice input. Particularly in mobile and automotive applications, continuous online connectivity may not be guaranteed at all times and in all locations, so a client-server voice-based user interface may be disabled in a client device whenever that device is “offline” and thus unconnected to an online service. Furthermore, even when a device is connected to an online service, the latency associated with online processing of a voice input, given the need for bidirectional communications between the client device and the online service, may be undesirably perceptible by a user.