The present invention relates generally to systems and methods for conversational computing and, in particular, to systems and methods for building distributed conversational applications using a Web services-based model wherein speech engines (e.g., speech recognition) and audio I/O systems are implemented as programmable services that can be asynchronously programmed by an application using a standard, extensible SERCP (speech engine remote control protocol), to thereby provide scalable and flexible IP-based architectures that enable deployment of the same application or application development environment across a wide range of voice processing platforms and networks/gateways (e.g., PSTN (public switched telephone network), Wireless, Internet, and VoIP (voice over IP)). The invention is further directed to systems and methods for dynamically allocating, assigning, configuring and controlling speech resources such as speech engines, speech pre/post processing systems, audio subsystems, and exchanges between speech engines using SERCP in a web service-based framework.
Telephony generally refers to any telecommunications system involving the transmission of speech information in either wired or wireless environments. Telephony applications include, for example, IP telephony and Interactive Voice Response (IVR), and other voice processing platforms. IP telephony allows voice, data and video collaboration through existing IP telephony-based networks such as LANs, WANs and the Internet as well as IMS (IP multimedia services) over wireless networks. Previously, separate networks were required to handle traditional voice, data and video traffic, which limited their usefulness. Voice and data connections where typically not available simultaneously. Each required separate transport protocols/mechanisms and infrastructures, which made them costly to install, maintain and reconfigure and unable to interoperate. Currently, various applications and APIs are commercially available that that enable convergence of PSTN telephony and telephony over Internet Protocol networks and 2.5G/3G wireless networks. There is a convergence among fixed, mobile and nomadic wireless networks as well as with the Internet and voice networks, as exemplified by 2.5G, 3G and 4G.
IVR is a technology that allows a telephone-based user to input or receive information remotely to or from a database. Currently, there is widespread use of IVR services for telephony access to information and transactions. An IVR system typically (but not exclusively) uses spoken directed dialog and generally operates as follows. A user will dial into an IVR system and then listen to an audio prompts that provide choices for accessing certain menus and particular information. Each choice is either assigned to one number on the phone keypad or associated with a word to be uttered by the user (in voice enabled IVRs) and the user will make a desired selection by pushing the appropriate button or uttering the proper word.
By way of example, a typical banking ATM transaction allows a customer to perform money transfers between savings, checking and credit card accounts, check account balances using IVR over the telephone, wherein information is presented via audio menus. With the IVR application, a menu can be played to the user over the telephone, whereby the menu messages are followed by the number or button the user should press to select the desired option:
a. xe2x80x9cfor instant account information, press one;xe2x80x9d
b. xe2x80x9cfor transfer and money payment, press two;xe2x80x9d
c. xe2x80x9cfor fund information, press three;xe2x80x9d
d. xe2x80x9cfor check information, press four;xe2x80x9d
e. xe2x80x9cfor stock quotes, press five;xe2x80x9d
f. xe2x80x9cfor help, press seven;xe2x80x9d etc.
To continue, the user may be prompted to provide identification information. Over the telephone, the IVR system may playback an audio prompt requesting the user to enter his/her account number (via DTMF or speech), and the information is received from the user by processing the DTMF signaling or recognizing the speech. The user may then be prompted to input his/her SSN and the reply is processed in a similar way. When the processing is complete, the information is sent to a server, wherein the account information is accessed, formatted to audio replay, and then played back to the user over the telephone.
An IVR system may implement speech recognition in lieu of, or in addition to, DTMF keys. Conventional IVR applications use specialized telephony hardware and IVR applications use different software layers for accessing legacy database servers. These layers must be specifically designed for each application. Typically, IVR application developers offer their own proprietary speech engines and APIs (application program interface). The dialog development requires complex scripting and expert programmers and these proprietary applications are typically not portable from vendor to vendor (i.e., each application is painstakingly crafted and designed for specific business logic). Conventional IVR applications are typically written in specialized script languages that are offered by manufacturers in various incarnations and for different hardware platforms. The development and maintenance of such IVR applications requires qualified staff. Thus, current telephony systems typically do not provide interoperability, i.e., the ability of software and hardware on multiple machines from multiple vendors to communicate meaningfully.
VoiceXML is a markup language that has been designed to facilitate the creation of speech applications such as IVR applications. Compared to conventional IVR programming frameworks that employ proprietary scripts and programming languages over proprietary/closed platforms, the VoiceXML standard provides a declarative programming framework based on XML (eXtensible Markup Language) and ECMAScript (see, e.g., the W3C XML specifications (www.w3.org/XML) and VoiceXML forum (www.voicexml.org)). VoiceXML is designed to run on web-like infrastructures of web servers and web application servers (i.e. the Voice browser). VoiceXML allows information to be accessed by voice through a regular phone or a mobile phone whenever it is difficult or not optimal to interact through a wireless GUI micro-browser.
More importantly, VoiceXML is a key component to building multi-modal systems such as multi-modal and conversational user interfaces or mobile multi-modal browsers. Multi-modal solutions exploit the fact that different interaction modes are more efficient for different user interactions. For example, depending on the interaction, talking may be easier than typing, whereas reading may be faster than listening. Multi-modal interfaces combine the use of multiple interaction modes, such as voice, keypad and display to improve the user interface to e-business. Advantageously, multi-modal browsers can rely on VoiceXML browsers and authoring to describe and render the voice interface.
There are still key inhibitors to the deployment of compelling multi-modal applications. Most arise out of the current infrastructure and device platforms. Indeed, the current networking infrastructure is not configured for providing seamless, multi-modal access to information. Indeed, although a plethora of information can be accessed from servers over a communications network using an access device (e.g., personal information and corporate information available on private networks and public information accessible via a global computer network such as the Internet), the availability of such information may be limited by the modality of the client/access device or the platform-specific software applications with which the user is interacting to obtain such information. For instance, current wireless network infrastructure and handsets do not provide simultaneous voice and data access. Middleware, interfaces and protocols are needed to synchronize and manage the different channels. In light of the ubiquity of IP-based networks such as the Internet, and the availability of a plethora a services and resources on the Internet, the advantages of open and interoperable telephony systems are particularly compelling for voice processing applications such as IP telephony systems and IVR.
Another hurdle is that development of multi-modal/conversational applications using current technologies requires not only knowledge of the goal of the application and how the interaction with the users should be defined, but a wide variety of other interfaces and modules external to the application at hand, such as (i) connection to input and output devices (telephone interfaces, microphones, web browsers, palm pilot display); (ii) connection to variety of engines (speech recognition, natural language understanding, speech synthesis and possibly language generation); (iii) resource and network management; and (iv) synchronization between various modalities for multi-modal or conversational applications.
Accordingly, there is strong desire for development of distributed conversational systems having scalable and flexible architectures, which enable implementation of such systems over a wide range of application environments and voice processing platforms.
The present invention relates generally to systems and methods for conversational computing and, in particular, to systems and methods for building distributed conversational applications using a Web services-based model wherein speech engines (e.g., speech recognition) and audio I/O systems are implemented as programmable services that can be asynchronously programmed by an application using a standard, extensible SERCP (speech engine remote control protocol), to thereby provide scalable and flexible IP-based architectures that enable deployment of the same application or application development environment across a wide range of voice processing platforms and networks/gateways (e.g., PSTN (public switched telephone network), Wireless, Internet, and VoIP (voice over IP)).
The invention is further directed to systems and methods for dynamically allocating, assigning, configuring and controlling speech resources such as speech engines, speech pre/post processing systems, audio subsystems, and exchanges between speech engines using SERCP in a web service-based framework.
In one preferred embodiment, a SERCP framework, which is used for speech engine remote control and network and system load management, is implemented using an XML-based web service framework wherein speech engines and resources comprise programmable services, wherein (i) XML is used to represent data (and XML Schemas to describe data types); (ii) an extensible messaging format is based on SOAP; (iii) an extensible service description language is based on WSDL, or an extension thereof, as a mechanism to describe the commands/interface supported by a given service; (iv) UDDI (Universal Description, Discovery, and Integration) is used to advertise and locate the service; and wherein (v) WSFL (Web Service Flow Language) is used to provide a generic mechanism from combining speech processing services through flow composition.
A conversational system according to an embodiment of the present invention assumes an application environment in which a conversational application comprises a collection of audio processing engines (e.g., audio I/O system, speech processing engines, etc.) that are dynamically associated with an application, wherein the exchange of audio between the audio processing engines is decoupled from control and application level exchanges and wherein the application generates control messages that configure and control the audio processing engines in a manner that renders the exchange of control messages independent of the application model and location of the engines. The speech processing engines can be dynamically allocated to the application on either a call, session, utterance or persistent basis.
Preferably, the audio processing engines comprise web services that are described and accessed using WSDL (Web Services Description Language), or an extension thereof.
In yet another aspect, a conversational system comprises a task manager, which is used to abstract from the application, the discovery of the audio processing engines and remote control of the engines.
The systems and methods described herein may be used in various frameworks. One framework comprises a terminal-based application (located on the client or local to the audio subsystem) that remotely controls speech engine resources. One example of a terminal based application includes a wireless handset-based application that uses remote speech engines, e.g., a multimodal application in xe2x80x9cfat client configurationxe2x80x9d with a voice browser embedded on the client that uses remote speech engines. Another example of a terminal-based application comprises a voice application that operates on a client having local embedded engines that are used for some speech processing tasks, and wherein the voice application uses remote speech engines when (i) the task is too complex for the local engine, (ii) the task requires a specialized engine, (iii) it would not be possible to download speech data files (grammars, etc . . . ) without introducing significant delays, or (iv) when for IP, security or privacy reasons, it would not be appropriate to download such data files on the client or to perform the processing on the client or to send results from the client.
Another usage framework for the invention is to enable an application located in a network to remotely control different speech engines located in the network. For example, the invention may be used to (i) distribute the processing and perform load balancing, (ii) allow the use of engines optimized for specific tasks, and/or to (iii) enable access and control of third party services specialized in providing speech engine capabilities.
These and other aspects, features, and advantages of the present invention will become apparent from the following detailed description of the preferred embodiments, which is to be read in connection with the accompanying drawings.