This invention relates to speech encoding in a client server interactive voice response (IVR) system and especially in a voice enabled hand held, thin client device, such as a laptop, personal data assistant or mobile phone.
For voice enabled hand held devices, such as personal data assistants and mobile telephones, access to back end services providing typical e-business and traditional telephony services is becoming more pervasive. For instance, a mobile telephone user may want to access bank account details through an automated telephone service that uses DTFM key recognition. The user dials the bank""s computer server telephone number and is asked to press certain keys in response to various prompts including menu options, account details and passwords. This approach relies on text or DTMF key input, forcing the user away from the benefits of hand free operation which is particularly necessary for mobile users in cars. Furthermore, it can be disruptive for many types of mobile phones where the keypad is located by the speaker to move the phone away from one""s ear to enter information.
One solution is to provide speech recognition as part of the telephony application back end service; for example, three different telephony applications with speech recognition capability are shown in FIG. 1. First, a text application is a messaging service which requires a user to enter a text message using the keys of the mobile device; the user speaks to the application and an IVR converts the speech to ASCII text for input to the application. A second application uses DTMF tones for a telephone banking service and requires the user to press certain keys in response to prompts to progress through the application. The user speaks to the application and the speech is converted into DTMF tones for the application. A third application uses Wireless Application Protocol (WAP) and uses a local speech to WAP converter. A user can connect to the network using a mobile phone over a wireless link to a base station physically connected to the network or using a land phone connected to the network. A user may connect to a telephony server using a voice enabled personal data assistant using a wireless link to a base station connected to the network or by a voice enabled computer Internet telephony connection via an Internet gateway to the telephony network. Alternatively, the telephony applications may themselves be Internet enabled and accessed using Internet telephony. Server voice recognition applications have a voice recognition function at the server end and suffer from the disadvantage that the speech signals deteriorate over transmission channels especially for analogue transmission. The capacity of a speech recognition server will limit the number of users. Also the server will be limited in the types of mobile device formats communicating with telephony applications.
According to one aspect of the invention there is provided a method of communication for a speech enabled remote telephony device comprising the following steps: receiving user speech input into the remote device; performing speech recognition to convert the speech into text; providing a set of tone responses expected by the application; mapping the text onto an allowed tone response; and transmitting the allowed tone response over the voice channel to an interactive voice response (IVR) telephony application, said application requiring a tone input format as part of an interactive dialogue with the user whereby the application receives an allowable tone response from the user speech input.
According to one aspect of the invention there is provided a method of communication for a speech enabled remote telephony device comprising the following steps: receiving user speech input into the remote device; performing speech recognition to convert the speech into text; converting the text into tones; and transmitting the tones over the voice channel to an interactive voice response (IVR) telephony application, said application requiring a tone input format as part of an interactive dialogue with the user.
This allows a remote device to communicate over a transmission channel to a destination server application and convert the device input format, be it text, speech or video into the server application format for a more robust transmission. Furthermore, the interactive voice application is enabled with voice recognition without voice recognition functionality in the application. The remote device may be a standard telephone over a fixed land telephony link, mobile cellular phone or a mobile computer such as a laptop or personal data assistant over a wireless communication link or any client device remote from a server. The remote device has a microphone for speech capture, but alternatively the mobile device may have a video camera for image capture whereby an interpretation of the image is converted into tones and the tones are transmitted to a telephony server.
The input format may be speech and the mobile device has speech recognition capacity that converts speech into ASCII text or DTMF tones or WAP format. The user may select the output format to be transmitted to the application. Alternatively, the mobile device may select the output format. The device may select the output format depending on a user command, on interaction with a server application, or on a setting stored on a mobile device associated with a server application.
The application may be a financial service or a voice messaging service whereby the caller interacts with the automated service via limited option menus to decide what information is required and what service is wanted. The speech capability may be low function discrete recognition (individual digits 0-9, yes, no, help, etc.). The speech capability may be word spotting from a vocabulary. The speech capability may be continuous recognition. The speech recognition capability is determined by the processing and storage capacity of the device.
This specification relates to the enhancement of a local device as described with speech input capabilities, allowing the user to interact in hands free mode, rapidly and naturally with any remote service. By converting the initial recognition result on the local device to the appropriate format (e.g. WAP commands or DTMF signals), the client-to-server transmission of user requests is rendered more robust over the transmission channel, and additionally, there is no specific requirement for the server to be speech enabled per se since the speech-to-text conversion is effected in the local client device.
Several advantages follow from the above approach. Remote services are speech-enabled without modification to the remote server. Command and control information passed from the client to the server benefits from robust and industry standard transmission. Personalisation is effected at the client end without the need for additional processing at the server end. Users are offered natural and easy to use hands free operation without the loss of quality of service. Users are offered transparent multi-mode accessibility without the service provider needed to make expensive or mode specific changes to their services.