Users have become accustomed to interacting with software applications through the use of a keyboard and pointing devices. And software application developers have become accustomed to providing screen devices (e.g., HTML widgets such as text boxes, dropdown menus, and buttons) to suggest possible actions a user can take, and/or commands a user can issue when interacting with the application with such devices as a keyboard and mouse.
In many cases, it might be more convenient for a user to interact using voice input such as by using spoken application commands (e.g., “SAVE”) and/or spoken selections (e.g., “CHOOSE MENU ITEM THREE”) and/or spoken navigation commands (e.g., “NEXT PAGE”). It is desired to offer the flexibility for a user to choose to use a keyboard and/or a pointing device to issue commands, and/or for the user to choose to issue voice commands. Such flexibility would improve the user interface and would provide a more user friendly experience. This is especially true in an enterprise setting or similar contexts where, for example, a user can navigate between different work areas comprising a suite of enterprise applications by merely uttering a navigation command (e.g., GO BACK”, or CANCEL).
Unfortunately, automatic speech recognition has long been plagued with problems such as recognition failures resulting from (1) speaker dependence, (2) difficulties in disambiguation between similar sounding words, (3) recognition of not-so-common or domain-specific terms, and (4) a myriad real-time issues when performing recognition and disambiguation.
Legacy approaches have attempted to use word and phrase dictionaries in an effort to reduce the severity of such problems. For example, if a speech recognition system were to be used in a hospital, clinic or other medical setting, the speech recognition system might include a dictionary of medical terms (e.g., terms of anatomy, names of therapies, names of prescription drugs, etc.). This legacy approach can improve over speech recognition systems that do not use a dictionary, nevertheless, domain-specific dictionaries can comprise tens of thousands of words (or more) and often, the extent (e.g., number of words) of the dictionary works against the desire to provide real-time speech recognition. If it could be predicted what a user is going to say (e.g., what words or phrases the user is likely to utter) then it might be possible to provide a smaller dictionary.
Other legacy approaches rely on pre-coding indications and corresponding aspects of voice-enabled commands into the user interface code (e.g., using V-HTML) in a timeframe prior to delivery of the interface page(s) to a user terminal (e.g., during development of the interface page). Such reliance on pre-coding voice commands has several drawbacks that need to be overcome. For example, pre-coding voice commands force the developer to pre-determine which commands are to be enabled for voice control (e.g., when using a browser), and how they are to be enabled. This restriction relies too heavily on the developer, and fails in many practical situations, such as when a user interface page is dynamically-created (e.g., by a content management system). A further drawback of legacy approaches is that voice-enabled browsers require the user to utter a keyword prior to a command utterance in order for the browser to distinguish between on pre-coded, page-specific voice commands (e.g., as may be present in the currently-displayed web page) from built-in browser-specific commands such a “BACK” or “RELOAD”. Verbal commands such as “COMPUTER, RELOAD” are cumbersome to users.
What is needed is an efficient and effective way to create a dictionary for enabling voice control of user interface pages of an application “on the fly”, wherein a text form of the command is rendered in the displayed portion of the interface page. None of the aforementioned legacy approaches achieve the capabilities of the herein-disclosed techniques for voice recognition of commands extracted-on-the-fly (e.g., from a user interface description). Therefore, there is a need for improvements.