This disclosure relates generally to computerized databases, and more specifically, to a voice command-driven system for selecting and controlling content in a dynamic list stored in a database.
Modern computing devices are able to access a vast quantity of information, both via the Internet and from other sources. Functionality for such devices is increasing rapidly, as mobile computing devices are able to run software applications to perform various tasks and provide different types of information. However, users who wish to operate a computing device while concurrently performing other activities (e.g., operating a vehicle, riding a bicycle, exercising, etc.), are visually impaired or disabled in some manner, or simply wish to rest their eyes while interacting with the device, may have difficulty interfacing effectively with their devices due to limited or no ability to read a display screen or physically interact with the device (e.g., using a keyboard, mouse, touch screen, etc.).
Many modern computing devices include functionality that enables a user to interact with the device using natural language, rather than employing a conventional manual user interface (e.g., menus or programmed commands). Most of the popular natural language voice recognition systems for mobile computing devices and consumer products today, such as Apple Inc.'s Siri® and Amazon.com, Inc.'s Amazon Echo®, rely on passing phonemes of speech over the Internet to a cloud-based automated speech recognition (ASR) system to decipher the phonemes as words (commonly known as “speech-to-text” (STT)). Powerful servers use natural language recognition (NLR) to then decipher meaning from the spoken utterances. However, these ASR and NLR systems do not function without a live Internet connection to pass the STT input from the user's device to the ASR/NLR server(s) and then back to the user's device for the intended results/actions.
Some mobile computing devices utilize command-driven ASR systems that allow for the spoken interaction to control the system on the mobile device without requiring a connection to the Internet. Command-driven ASR systems typically rely on a limited vocabulary list of words at any given time during the course of interaction by the user and may be part of an embedded system within a mobile device that does not require a remote server to translate the STT to control the system. In such embedded systems, the user is predominantly accessing a limited type of data (e.g., phone numbers, music, etc.) that is generally known to the user at the time of a voice command input. Systems that rely on commands, however, shift the burden to the user to remember different commands or keywords in a dynamic implementation of the vocabulary list, thus increasing the difficulty for the user to know, remember or guess the commands to enable useful control and interaction. For this reason, conventional embedded, command-driven ASR systems are suitable for limited applications in mobile devices (e.g., retrieving phone numbers or email addresses, selecting music, or requesting directions to a specific address) where the vocabulary list is limited, finite, and generally known by the user.
Conventional command-driven, embedded ASR systems are not suitable for more complex applications requiring a large vocabulary due to the limited computational, memory and battery resources of mobile computing devices. As the vocabulary required for responses increases or varies, the accuracy of the speech recognition decreases in embedded ASR systems. Therefore, it is desirable to reduce the number of commands to increase the accuracy of the embedded ASR system.
In addition, there are many applications that require large vocabularies, oftentimes without the ASR system or the user knowing in advance what vocabulary is required. For instance, in the context of news feeds, such as Atom and Really Simple Syndication (RSS) feeds, a list of current headlines for news content is dynamic, including vocabulary that is essentially limitless and not readily known to the system or user in advance. Because certain words are harder for an embedded ASR to recognize, interpretation of STT would typically be offloaded over the Internet to an external server having greater processing power.
Another area that adds complexity is the interaction with an ASR system using the microphone and speaker of a mobile device. Because the microphone is typically close to the speaker on most mobile devices, the ASR system can erroneously act upon its own TTS output or ambient sounds when “listening” for a voice command from the user. Additionally, it can be a challenge for the user to know when to speak while interacting with a TTS list and relying on a an erratic pause delay in the TTS between varied-length content, such as, for example, a list of news headlines. The user doesn't know when the TTS of the individual content has concluded without a delay in their response time. The pause length between the TTS of content can be set to address the time needed for the user, but still requires a lot of attention for the user to respond quickly enough to speak to initiate a selection or increase the overall time it takes for the user to navigate through the list of content.
Accordingly, there is a need for a voice command-driven, embedded ASR system that is not dependent on having an Internet connection, and allows a user to use and control the TTS playback, including pause length, for a dynamic list with a limited number of simple voice commands to navigate dynamic, unknown content stored in a database.