The present invention relates generally to control of service applications, more particularly to voice control of service applications, and still more particularly to voice control of service applications from a remote terminal.
The most common type of terminal for Internet access is a conventional personal computer (PC) terminal with a large, high resolution display and a relatively high data transmission bandwidth. When a user wishes to use an Internet connection to control a service application located at a remote location, he or she typically uses the keyboard associated with the PC terminal, and types in commands. The data is communicated via the Internet to the service application, which can then respond accordingly. The user's PC terminal display permits response information to be displayed in the form of text and/or graphics which can easily be viewed by the user.
The recent standardization of a Wireless Application Protocol (WAP), using the Wireless Markup Language (WML), has enabled terminals with small displays, limited processing power and a low data transmission bandwidth (e.g., digital cellular phones and terminals) to access and control services and content in a service network such as the Internet. WAP is a layered communication protocol that includes network layers (e.g., transport and session layers) as well as an application environment including a microbrowser, scripting, telephony value-added services and content formats. The simple syntax and limited vocabulary in WML makes WAP suitable for controlling the service and for interacting with the content from a client terminal with low processing and display capabilities.
While the ability to use these smaller terminals is a major convenience to the user (who can more readily carry these along on various journeys), reading selection menus and other large amounts of text (e.g., e-mail and help text) from a small display and typing in responses on a small keyboard with multi-function keys has some disadvantages. These disadvantages may be largely overcome by the substitution of a voice-controlled interface to the service application. A voice-controlled interface is also useful for providing "hands-free" operation of a service application, such as would be required when the user is driving a car.
Automatic speech recognition systems (ASR) are known. An ASR for supporting a voice-controlled application may be a user shared resource in a central server or a resource in the client terminal. The simpler ASR recognizes isolated words with a pause in-between words, whereas the advanced ASR is capable of recognizing connected words. The complexity of the ASR increases with the size of the vocabulary that has to be recognized in any particular instance of the dialog with the application.
If the ASR is implemented at a central server, it must be capable of recognizing many users with different languages, dialects and accents. Conventional speaker-independent speech recognition systems normally use single word ASR with a very limited vocabulary (e.g., "yes", "no", "one", "two", etc.) to reduce the amount of required processing and to keep the failure rate low. Another alternative for improving the accuracy of recognition is to make the speech recognition adaptive to the user by training the recognizer on each user's individual voice, and asking the user to repeat or spell a misunderstood word. In a multi-user environment, each user's profile must be stored.
An implementation of the speech recognizer in the terminal will only have to recognize one user (or a very small number of users) so adaptive training may be used. The required processing for a combined word ASR may still be too large to be implemented in the terminal. For example, the processing power of today's mobile terminals (such as those employed in cellular telephone systems, personal digital assistants, and special purpose wireless terminals) is sufficient for implementing an isolated word ASR with a small vocabulary (e.g., for dialing and for accessing a personal telephone book stored in the terminal). Training may be required for adding new words to the vocabulary.
A problem that exists in present day centralized server ASRs is that a voice channel (voice call) has to be established between the terminal and a gateway or server that performs voice recognition. However, a voice channel may introduce distortion, echoes and noise that will degrade the recognition performance.
A centralized ASR is also an expensive and limited network resource that will require a high processing capability, a large database and an adaptive training capability for the individual voices and dialects in order to bring down the failure rate in the recognition process. Because it is a limited resource, the central server or gateway may need to implement a dial up voice channel access capability.
The new generation of WAP-supported mobile terminals will be able to control and interact with a large variety of services and content. However, the terminal display and keyboard typically have very limited input/output (I/O) capability, thereby making a voice-controlled interface desirable. As explained above, today's low-cost terminals can support some ASR capability, but this is inadequate to support voice access to a multi-user Application Server that will require a large vocabulary or a time-consuming training of the recognizer for each application.