Modern computing devices are able to access a vast quantity of information, both via the Internet and from other sources. Functionality for such devices is increasing rapidly, as mobile computing devices are able to run software applications to perform various tasks and provide different types of information. However, modern computing devices primarily rely upon output content to a user via a visual screen, and acknowledging user only via a screen. As a result, users who wish to operate a computing device either while concurrently performing other distracting or strenuous activities (e.g., operating a vehicle, riding a bicycle, exercising, etc.), are visually impaired or disabled in some manner, or simply wish to rest their eyes while interacting with the device, may have difficulty interfacing effectively with their devices due to limited or no ability to read a display screen or physically interact with the device using existing physical input methods.
Some modern computing devices include functionality that enables a user to interact with the device using spoken natural language, rather than employing a conventional manual user interface. Most of the popular natural language voice recognition systems for mobile computing devices and consumer products today, such as Apple Inc.'s Siri® and Amazon.com, Inc.'s Amazon Echo® utilize command-driven ASR systems that allow for the spoken interaction to control the system on the mobile device. Existing systems do not provide a sustained interaction predicated by the first action initiated by the user, but rather respond with a single result—for example, playing a song, or providing a single fact that is the answer to a question.
Command-driven ASR systems typically rely on a limited vocabulary list of words at any given time during the course of interaction by the user and may be part of an embedded system within a mobile device that does not require a remote server to translate the STT to control the system. In such embedded systems, the user is predominantly accessing a limited type of data (e.g., phone numbers, music, etc.) that is generally known to the user at the time of a voice command input.
Systems that rely on commands, however, shift the burden to the user to remember different commands or keywords in a dynamic implementation of the vocabulary list, thus increasing the difficulty for the user to know, remember or guess the commands to enable useful control and interaction. For this reason, conventional embedded, command-driven ASR systems are suitable for limited applications in mobile devices (e.g., retrieving phone numbers or email addresses, selecting music, or requesting directions to a specific address) where the vocabulary list is limited, finite, and generally known by the user.
Conventional command-driven, embedded ASR systems are not suitable for more complex applications requiring a large vocabulary due to the limited computational, memory and battery resources of mobile computing devices. As the vocabulary required for responses increases or varies, the accuracy of the speech recognition decreases in embedded ASR systems. In addition, there are many applications that require large vocabularies, oftentimes without the ASR system or the user knowing in advance what vocabulary is required.
Another area that adds complexity is the interaction with an ASR system using the microphone and speaker of a device. Because the microphone is typically close to the speaker on most mobile devices, the ASR system can erroneously act upon its own TTS or spoken output or ambient sounds if simultaneously “listening” for a voice command from the user. Additionally, it can be a challenge for the user to know when to speak while interacting with a TTS list and relying on an erratic pause delay in the TTS between varied-length content. The user doesn't know when the TTS of the individual content has concluded without a delay in their response time. The pause length between the TTS of content can be set to address the time needed for the user, but still requires a lot of attention for the user to respond quickly enough to speak to initiate a selection or increase the overall time it takes for the user to navigate through the list of content.
To address the spoken voice feedback loop, some digital personal assistants utilize ASR systems that are always listening but require the user to use a keyword to let the system know that the user is initiating voice interaction. This creates awkward interaction because the user cannot continue with the system after receiving a response without using the keyword. It relegates these systems to a form of communication that resembles amateur radio.
Additionally, natural language systems are capable of deciphering the meaning of a user query and provide a series of result descriptions correlating with the query. However, these systems do not offer a method for the user to then continue to use spoken input to select one of the results from the list and initiate the presentation of the content associated with a particular result description as well as traversing back to the list of result descriptions and interact with another result and its associated content, all by way of spoken input.
Accordingly, there is a need for a simple command system with a minimal number of commands or equivalent command-actions that allows the user to easily interact and control the system in a sustained, interactive manner as well as navigate dynamic, unknown content.