The present invention relates to access of information over a wide area network such as the Internet. More particularly, the present invention relates to web enabled recognition allowing information and control on a client side device to be entered using a variety of methods.
Small computing devices such as personal information managers (PIM), devices and portable phones are used with ever increasing frequency by people in their day-to-day activities. With the increase in processing power now available for microprocessors used to run these devices, the functionality of these devices are increasing, and in some cases, merging. For instance, many portable phones now can be used to access and browse the Internet as well as can be used to store personal information such as addresses, phone numbers and the like.
In view that these computing devices are being used for browsing the Internet, or are used in other server/client architectures, it is therefore necessary to enter information into the computing device. Unfortunately, due to the desire to keep these devices as small as possible in order that they are easily carried, conventional keyboards having all the letters of the alphabet as isolated buttons are usually not possible due to the limited surface area available on the housings of the computing devices.
Recently, voice portals such as through the use of VoiceXML (voice extensible markup language) have been advanced to allow Internet content to be accessed using only a telephone. In this architecture, a document server (for example, a web server) processes requests from a client through an interpreter. The web server can produce documents in reply, which are processed by the interpreter and rendered audibly to the user. Using voice commands through voice recognition, the user can navigate the web. A disadvantage of these systems is that once the page has been provided to the client device, the user has little if any control on what portions of the dialog formed with the user will be executed.
In another form of client side device, speech recognition is used in conjunction with a display. For example, a user can complete a form or otherwise provide information by indicating the fields on the display that subsequent spoken words are provided as an input. Specifically, in this mode of data entry, the user is generally under control of when to select a field and provide corresponding information. After selecting a field, the user provides input for the field as speech. This form of entry using both a screen display and allowing free form selection of fields and voice recognition is called “multi-modal”.
Although, the multi-modal technique of data entry is convenient there are some limitations or difficulties with this type of input. For example, after selecting a field and providing speech, if nothing appears in the field, the user may not know the cause of the problem. In particular, was the user not heard or was the input not recognized, etc? Other situations or difficulties include when the user selects a field but does not know what to say. In yet another situation, problems can occur if the user provides speech input, but an error has occurred in part of the recognized response. For example, if the user has provided speech corresponding to the digits of a credit card and only one or some of the digits are in error.
In a voice-only environment as described above, techniques are provided to handle these situations. For instance, if the user is not heard, a prompt can be replayed. Also, confirmation is often used in order to ensure accuracy of recognized input speech. Thus, after a user provides speech input, the system will audibly return the recognized result and ask if it is correct. In a situation where the recognized result has an error in part, the system may breakup the request of the information into smaller portions and use confirmation for each portion. However, simple application of the dialogue used in the voice-only environment would not solve these problems in the multi-modal environment due to the freeform navigation provided by the multi-modal environment.
Thus, there is an on-going need to improve upon the methods used to provide input in a multi-modal or voice-only environments. A system and method that addresses one, several or all of the aforementioned problems would be particularly helpful.