The disclosure relates to voice-controlled devices. More particularly, the methods and systems described herein relate to functionality for voice-based programming of a voice-controlled device.
There has been an increase in the adoption and function set of voice-activated devices in consumer and commercial markets. Devices such as AMAZON ECHO and GOOGLE HOME currently support network-connected applications and interfaces that can be used to control home lighting, entertainment systems, and specially-programmed appliances. Other devices allow for simple question and answer style conversations between the user and the device, with the device programmed to provide voice responses to user utterances; conventional applications of such devices are the playing of media, searching for data, or simple conversation (e.g., the device allows the user to ask the device to tell a joke or provide a weather forecast, and the device complies).
However, such voice-activated device applications can be personalized only to a limited extent by the end user, based on a fixed and pre-defined set of vendor-supported features. Adding new applications or modifying the functionality of existing applications can typically only be accomplished in conventional systems through use of the vendor-supported, text-based programming languages. Although many conventional systems provide functionality for improving a level of accuracy in interpreting audio input (e.g., via expanded or customized vocabulary sets), conventional systems do not typically provide for creation of new programs for execution by the voice-activated and voice-response devices, much less via a verbal dialog with the device, in spite of the device's conventional capability to receive and respond to verbal commands. As a result, end-users and organizational adopters of conventional voice-activated devices who are not skilled in conventional programming languages face significant barriers in extending and adding functionality to the devices.
Furthermore, conventional voice-activated and voice-response devices typically require a network connection in order to perform natural language processing of user utterances. For example, some such devices constantly monitor all human utterances uttered within range of a microphone in the device and upon determining that a particular utterance includes a particular word or phrase, the device begins transmission of subsequent utterances over a network to a remotely located machine providing natural language processing services. Such devices typically rely upon or require word-level, speech-to-text translations of audio input and require a level of speed, accuracy, and processing power in excess of the limited natural language processing available in the voice-activated and voice-response devices; therefore, this use of network connectivity provides improved natural language processing and improves the utility of the device. However, leveraging remote processing over a network raises additional concerns, such as transmission reliability and the utility of the device without a networking connection as well as privacy and security concerns regarding the transmission of non-public, conversational utterances to a remote computing service.
In some vendor-supplied application development environments, the application programmer must create and maintain multiple parts of an application, which may in turn have to be written in different programming languages; furthermore, a different development environment may have to be used to create and maintain each part. Using different programming languages and different development environments is not only inefficient but creates the problem of keeping all the parts in synchronization. Thus, using these vendor-supplied application development environments may require knowledge and expertise not typically possessed by end users who are not skilled in computer programming. Furthermore, the application programmer is required to acquire and become skilled in the use of a network-connected computer in order to communicate with the vendor's backend program development services. As a result of these and other such barriers, non-technical users of voice-controlled devices are effectively prohibited from adapting and extending the functionality of these devices.
Although techniques are known that minimize or eliminate the need to perform word-level natural language processing of audio signals, such techniques are not conventionally used to allow speakers to create new programs executable by voice-controlled devices, modify existing programs executable by voice-controlled devices, modify the data structures stored by voice-controlled devices, or otherwise interact with the voice-controlled device using audio input to generate and execute computer programs.
Historically, voice-controlled devices formed or were part of systems such as interactive phone systems in which a non-programmer user neither owns the device nor wishes to speak with the device, much less possesses the skills or permissions necessary to modify or extend the capabilities of the systems. For instance, a typical user trying to reach a customer service representative by calling into an interactive phone system does not wish to ask the phone system what the weather is or if it can play a certain song or share a knock-knock joke; the typical user of such a device wishes to keep the interaction with the device as short as possible and limited to a specific, structured, and pre-defined interaction. This is in stark contrast to a home robot, industrial control panel, self-driving vehicle, or other voice-controlled and voice-response device, where the typical user engages in a more free-form, conversational interaction with the device. In these cases, it is natural and compelling for the user to wish to personalize and adapt the device to their own needs, desires, and modes of utilization. However, manufacturers of such newer devices have not typically provided the capability to modify or extend the built-in conversational scenarios for the user who may wish to engage in more than canned dialog and does not wish to keep the conversation artificially short, but wishes to develop their own functionality and programs to extend the utility of conversing with the device, including through the development of wholly new applications.
Thus, there is a need for improved functionality and ease of use for programming voice-controlled, voice-response devices by users via spoken dialog with the devices being programmed.