A speech application is one of the most challenging applications to develop, deploy and maintain in a communications (typically telephony) environment. Expertise required for developing and deploying a viable application includes expertise in computer telephony integration (CTI) hardware and software, voice recognition software, text-to-speech software, and speech application logic.
With the relatively recent advent of voice extensive markup language (VXML) the expertise require to develop a speech solution has been reduced somewhat. VXML is a language that enables a software developer to focus on the application logic of the voice application without being required to configuring underlying telephony components. Typically, the developed voice application is run on a VXML interpreter that resides on and executes on the associated telephony system to deliver the solution.
As is shown in FIG. 1A (prior art) a typical architecture of a VXML-compliant telephony system comprises a voice application server (110) and a VXML-compliant telephony server (130). Typical steps for development and deployment of a VXML enabled IVR solutions are briefly described below using the elements of FIG. 1A.
Firstly, a new application database (113) is created or an existing one is modified to support VXML. Application logic 112 is designed in terms of workflow and adapted to handle the routing operations of the IVR system. VXML pages, which are results of functioning application logic, are rendered by a VXML rendering engine (111) based on a specified generation sequence.
Secondly, an object facade to server 130 is created comprising the corresponding VXML pages and is sent to server 130 over a network (120), which can be the Internet, an Intranet, or an Ethernet network. The VXML pages are integrated into rendering engine 111 such that they can be displayed according to set workflow at server 110.
Thirdly, the VXML-telephony server 130 is configured to enable proper retrieval of specific VXML pages from rendering engine 111 within server 110. A triggering mechanism is provided to server 110 so that when a triggering event occurs, an appropriate outbound call is placed from server 110.
A VXML interpreter (131), a voice recognition text-to-speech engine (132), and the telephony hardware/software (133) are provided within server 130 and comprise server function. In prior art, the telephony hardware/software 130 along with the VXML interpreter 131 are packaged as an off-the-shelf IVR-enabling technology. Arguably the most important feature, however, of the entire system is the application server 110. The application logic (112) is typically written in a programming language such as Java and packaged as an enterprise Java Bean archive. The presentation logic required is handled by rendering engine 111 and is written in JSP or PERL.
An enhanced voice application system is known to the inventor and disclosed in the U.S. patent application entitled “Method and Apparatus for Development and Deployment of a Voice Software Application for Distribution to one or more Application Consumers” to which this application claims priority. That system uses a voice application server that is connected to a data network for storing and serving voice applications. The voice application server has a data connection to a network communications server connected to a communications network such as the well-known PSTN network. The communication server routes the created voice applications to their intended recipients.
A computer station is provided as part of the system and is connected to the data network and has access to the voice application server. A client software application is hosted on the computer station for the purpose of enabling users to create applications and manage their states. In this system, the user operates the client software hosted on the computer station in order to create voice applications through object modeling and linking. The applications, once created, are then stored in the application server for deployment. The user can control and manage deployment and state of deployed applications including scheduled deployment and repeat deployments in terms of intended recipients.
In one embodiment, the system is adapted for developing and deploying a voice application using Web-based data as source data over a communications network to one or more recipients. The enhanced system has a voice application server capable through software and network connection of accessing a network server and Web site hosted therein and for pulling data from the site. The computer station running a voice application software has control access to at least the voice application server and is also capable of accessing the network server and Web site. An operator of the computer station creates and provides templates for the voice application server to use in data-to-voice rendering. In this aspect, Web data can be harvested from a Web-based data source and converted to voice for delivery as dialogue in a voice application.
In another embodiment, a method is available in the system described above for organizing, editing, and prioritizing the Web-based data before dialog creation is performed. The method includes harvesting the Web-based data source in the form of its original structure; generating an object tree representing the logical structure and content type of the harvested, Web-based data source; manipulating the object tree generated to a desired hierarchal structure and content; creating a voice application template in VXML and populating the template with the manipulated object tree; and creating a voice application capable of accessing the Web-based data source according to the constraints of the template. The method allows streamlining of voice application deployment and executed state and simplified development process of the voice application.
A security regimen is provided for the above-described system. The protocol provides transaction security between a Web server and data and a voice portal system accessible through a telephony network on the user end and through an XML gateway on the data source end. The regimen includes one of a private connection, a virtual private network, or a secure socket layer, set-up between the Web server and the Voice Portal system through the XML gateway. Transactions carried on between the portal and the server or servers enjoy the same security that is available between secure nodes on the data network. In one embodiment, the regimen further includes a voice translation system distributed at the outlet of the portal and at the telephone of the end user wherein the voice dialog is translated to an obscure language not that of the users language and then retranslated to the users language at the telephone of the user.
In such as system where templates are used to enable voice application dialog transactions, voice application rules and voice recognition data are consulted for the appropriate content interpretation and response protocol so that the synthesized voice presented as response dialog through the voice portal to the user is both appropriate in content and hopefully error free in expression. The database is therefore optimized with vocabulary words that enable a very wide range of speech covering many different vocabulary words akin to many differing business scenarios.
According to yet another aspect of the invention, vocabulary recognition is tailored for active voice applications according to client parameters. This is accomplished through a vocabulary management system adapted to constrain voice recognition processing associated with text-to-speech and speech-to-text rendering associated with use of an active voice application in progress between a user accessing a data source through a voice portal. The enhancement includes a vocabulary management server connected to a voice application server and to a telephony server, and an instance of vocabulary management software running on the management server for enabling vocabulary establishment and management for voice recognition software. In practice of the enhanced vocabulary management capability, an administrator accessing the vocabulary management server uses the vocabulary management software to create unique vocabulary sets or lists that are specific to selected portions of vocabulary associated with target data sources the vocabulary sets differing in content according to administrator direction.
It will be appreciated by one with skill in the art of voice application deployment architecture that many users vying to connect and interact with a voice portal may in some cases create a bottleneck wherein data lines connecting voice application components to Web-sources and other data sources become taxed to their capacities. This problem may occur especially at peak use periods as is common for many normal telephony environments. It has occurred to the inventor that still more streamlining in terms of traffic optimization is required to alleviate potential line-use issues described above.
A particular enhancement to the voice application distribution system known to the inventor addresses the traffic challenges described in the above paragraph. Application logics are provided for determining which portions (dialogs) of a voice application for deployment are cached at an application-receiving end system based on static and dynamic rules and in some cases (dynamic caching), statistical analysis results are used in the determination. The application logic utilizes a processor for processing the voice application according to sequential dialog files and rules of the application. Logic components include a static content optimizer connected to the processor for identifying files containing static content; and a dynamic content optimizer connected to the processor for identifying files containing dynamic content. The optimizers determine which files should be cached at which end-system facilities, tag the files accordingly, and prepare those files for distribution to selected end-system cache facilities for local retrieval during consumer interaction with the deployed application.
Being able to retrieve dialog portions of a voice application from a local cache facility increases response time at the voice portal by decreasing the load on the network connection to the voice application server. However, in addition to reduced traffic requirements, it is also important that text to speech recognition and speech to text renderings are clear and accurate. Accuracy of synthesized speech delivered to a caller is key to creating a successful voice application that can be interacted with in a dynamic fashion at both ends.
As voice application distribution architectures expand to cross regional boundaries and even cultural boundaries the prospect of standardizing speech recognition rules dealing with terms and phrases that are commonly spoken becomes increasingly difficult. For example, pronunciations of certain terms in a same language will vary significantly according to region. Common labels such as the way major roads and highways are written and spoken can also vary significantly. There are many examples of phrase and term variations that need to be addressed if voice application interaction is practiced on larger architectures spanning large geographic regions.
In yet another system enhancement known to the inventor, text-to speech preprocessing is used to render synthesized voice that is somewhat personalized to a caller according to pre-set constraints. The enhanced system is capable of preprocessing text strings for VXML view generation and subsequent voice rendering. The system has a text-to-speech preprocessing logic and a software table accessible to the preprocessing logic, the table adapted to serve text dialog options related to one or more text entities.
A rules base is provided and accessible to the preprocessing logic. The rules base is adapted to serve dialog selection constraints used to match specific dialog portions that are then used to annotate a text string. Dialog options and text entities are stored in an accessible data store. In a preferred embodiment the preprocessing logic accesses the software table during client interaction with a deployed voice application and selects a specific dialog option from more than one dialog option related to a single text entity, and inserts the selected option into the VXML page rendering process, the selection is made according to return of one or more of the served constraints.
While the enhanced system provides personalization of voice dialog to specific groups of callers depending upon pre-set constraints, which may cover a wide variety of industry specific, social, geographic and cultural considerations, the system is still largely robotic and does not respond to individual attitudes and behaviors. It has occurred to the inventor that instant attitudes moods and behaviors of callers interacting with a voice application, if understood at the time of interaction, could be leveraged to increase customer satisfaction, enterprise sales figures, and efficiency of the interaction process in general.
A behavioral adaptation engine is known to the inventor and is integrated with a voice application creation and deployment system. The adaptation engine has at least one data input port for receiving XML-based client interaction data including audio files attached to the data; at least one data port for sending data to and receiving data from external data systems and modules; a logic processing component including an XML reader, voice player, and analyzer for processing received data; and a decision logic component for processing result data against one or more constraints. The engine intercepts client data including dialog from client interaction with a served voice application in real time and processes the received data for behavioral patterns and if attached, voice characteristics of the audio files whereupon the engine according to the results and one or more valid constraints identifies one or a set of possible enterprise responses for return to the client during interaction.
The enhanced system described in the above paragraph can dynamically select responses based on detection of a particular mood state and re-arrange a menu or response-options accordingly. The behavioral adaptation engine has the capability of determining what appropriate response dialog from a pool of possible dialogs will be executed during a session based on voice and selection analysis performed by the client during the session.
In addition to the much-enhanced voice application system known to the inventor, there are several prior-art VXML compliant voice application deployment systems that use various proprietary grammar mark-up languages or script languages for creating voice applications that can be used only with certain voice systems, which may then render the script as a standard VXML or CCXML and distributed to portal (client access) systems. A proprietary grammar language may be used on the application side (application language) as input to a VXML rendering engine wherein the output is the W3C standard VXML, which is useable at the interaction point of the caller. Caller responses then may be transported as VXML back to the proprietary system and translated back into the application language for interpretation and dialog service at the application server site.
To give one example of the above-described interactivity, a speech synthesis engine capable of text-to-speech and speech-to-text conversion owned by Nuance™ Corporation is integrated into a voice application deployment system known as the Tellme™ system. Nuance™ provides a proprietary scripting language known as Grammar Specification Language (GSL) for creating voice applications. The GSL is converted to VXML that is interpreted by speech synthesis engine interacting with the caller at a VXML-enabled Web-based Portal or telephony IVR system.
More recently, the Worldwide Web (W3C) referenced herein by the address http://www.w3c.org/ has been developing a grammar extensible mark-up language (GRXML) that can be used with Nuance™, SpeechWorks™, and other speech engine technologies, that support VXML and in some cases CCXML, the latter of which provides more integrated telephony call-control functionality than is supported by VXML such as outbound calling and so on.
FIG. 33 illustrates an overview 3300 of a prior-art relationship between various script languages input into different core VMXL rendering engines. GSL 3301 is used as was described above as input into a Nuance™ engine 3302. A GRXML language 3303 is supported by the Nuance™ engine, and a SpeechWorks™ engine 3304. Other existing or newly developed XML-based script languages 3306 are used as input into other proprietary engines 3305, and so on.
Although there is some interoperability using a semi standard like GRXML with respect to different application languages used by proprietary VXML compliant systems, GRXML is not useable in many systems. GRXML may only be compatible with the larger and most popular systems that are widely recognized. A customer site might have more than one different proprietary system deployed wherein GRXML is not supported and might have to move from one system to another during interaction. An example would be that of an enterprise contracting with more than one speech application deployment service and architecture. In this respect there would be some difficulty in that new scripts would have to be written that support the particular engine the customer is using.
In addition to the above-describe problem, there are still many limitations apparent with client-to-system voice application-driven sessions. Voice synthesis tends to rely on single speech components representing parts of a subject matter that must be communicated in order to complete a transaction. For example, city, state, and country represent three components of location information that must be provided in order to complete some transactions. Prior-art application systems typically deal with these components separately by using three separate prompt/response actions. While the behavior adaptation engine known to the inventor may offer some streamlining by allowing a client to skip certain standard prompts of a voice application more enhancement is required to further streamline interaction between clients and a voice application.
What is clearly needed are methods and apparatus for enabling inference of client objectives when interacting with a voice application and a platform and system independent script language that can bridge multiple end systems to an application server system. A system of such capability could reduce and eliminate the above-states limitations.