Providing an ability for devices to understand spoken language has been a goal since the early days in user interface (UI) development. The accuracy of spoken word recognition has improved dramatically in recent years, and is now being practically implemented on many devices.
In applications (apps) that can be controlled through graphical user interfaces (GUIs) (e.g., the touch screen, mouse, keyboard/keypad, etc.), developers may use GUI event handling mechanisms to define the control over the GUIs. GUI events are typically generated by GUI components (e.g., buttons) and received by event listeners/handlers (code that contains the business logic, written by the developer) for processing.
FIG. 1 illustrates an exemplary UI menu tree structure for the Waze application, which is a popular social navigation app that allows users to share traffic conditions. As with most similar applications, this application utilizes a hierarchical tree menu structure having a main menu that includes a number of menu items, such as “Share” illustrated in the Figure. In a touch screen embodiment of the UI, the user might select the “Share” command and then be presented with further sub-menu options, such as “Email”. Selection of this item allows the user to enter an e-mail utility to communicate with another person. However, these hierarchical tree structures are limited by screen assets, since one can only put so much on a typical small screen; even on big screens, the GUI can still be difficult to use if there are too many GUI components. Thus the hierarchical trees can be difficult for users to navigate.
The difference between touch control and voice control is that, with touch control, a touch input is of a specified and controlled type (e.g., specific button press, slider drag, etc.) and can be easily recognized deterministically through the touch panel. For example, if an application is expecting a user to select one of three options for a purchase, it can present three mutually exclusive radio buttons to the user. When the user selects one, it is clear which of the three expected buttons the user selected.
Voice control provides many benefits that are not present in the traditional touch-screen or keyboard device user-interface, in addition to generally being hands free. With voice control, it is easier for the user to discover and use various functionalities of the application—one does not have to have a detailed knowledge of a command/menu tree structure to know where to locate various functions. A voice UI is not limited by the screen assets available on the device and can help avoid an overload of control elements on a display. Also, a voice UI may be easier to learn how to use, since commands can be issued in natural language, and simple tutorials can be very effective.
For example, a user can rate a particular application without having to know where the “rate” function is within a menu structure, and can set application parameters and values without knowing, e.g., whether a particular setting is in a “general” or “advanced” setting area. It is also easier to navigate functionalities within the application. For example, a user may need only say, “Change language to French”, in order to change the operative language of the application or device—the user does not have to wander through a menu hierarchy to locate this functionality.
Furthermore, voice control allows including multiple parameters in one spoken phrase (e.g., “Rate this application as 5 stars—best app ever!”), or even multiple actions in one spoken phrase: “Report traffic jam and find new route”; or “Find San Francisco Zoo and make it a favorite”. This is simple to do even if the multiple actions would normally be located in two separate branches of a touch UI menu tree.
Take the Waze application as an example. An application programmer may wish to add a voice UI for a common action such as geographic navigation. Once added, the navigation could then be performed by the user with respect to an address (e.g., “drive to 123 N. Main St.”) or to a favorite, such as a friend's home (“drive to Joe Schmoe's house”) (presuming the address is in the user's contact book or otherwise accessible). In addition, an application programmer may wish to add a voice UI for a special action that is defined by the programmer—such as a report action (e.g., “the traffic at my location is heavy”).
Developers of the Waze application have recently added support for a few simple voice commands (e.g., “drive to”, “report traffic”) into Waze. Ideally, an app like Waze should interact with users using natural language (as opposed to strict commands) for most functionalities, if not all, to be fully hands-free. Building such a complete voice UI, however, is a daunting task. Whereas developers have access to some fair-quality speech recognition engines, they have less access to technologies on natural language understanding.
In prior art, there is little framework or tool support to help adding natural-language voice UI into apps. Specifically, existing solutions to voice enabling applications include speech-to-text application program interfaces (APIs), grammar-based speech engines, and invocation of apps by speech.
Solutions on speech-to-text APIs (e.g., Dragon Mobile, MaCaption) help developers to translate speech input into text. Developers use an API to feed speech input to the speech engine behind the solution, which in turn returns a piece of text (e.g., “How to I get home?”). The developer may then parse the text to figure out the intention (i.e., what program functionality the user intends to execute) of the speech (e.g., find direction to home address) and turn the text into actionable commands. Obviously, much work on natural language understanding is needed by the developer to process natural language commands in plain text format.
Solutions on grammar-based speech engines (e.g., iSpeech, LumenVox, Tellme, OpenEars, WAMI) require that developers provide a grammar (in either a standard format such as the Speech Recognition Grammar Specification (SRGS) or a vendor-specific format) characterizing all allowed speech input. Take the following grammar in pseudo code as an example:
 $place = home | office$command = drive to ($place)
The above example allows two specific commands only: “drive to home” and “drive to office”. The complexity of the grammar increases dramatically if the developer would like to support a larger set of commands and/or natural language commands (commands in a more flexible format, such as in verbal languages).
Solutions on invocation of apps by speech (e.g., Siri, Dragon Go!, Vlingo) provide a means for developers to link their apps with a speech-enabled dispatch center. For example, if an app is linked with Siri, then users can launch the app from Siri using voice commands. However, there are no voice commands supported in the app itself, once the app is launched.