A speech application is one of the most challenging applications to develop, deploy and maintain in a communications (typically telephony) environment. Expertise required for developing and deploying a viable application includes expertise in computer telephony integration (CTI) hardware and software, voice recognition software, text-to-speech software, and speech application logic.
With the relatively recent advent of voice extensive markup language (VXML) the expertise require to develop a speech solution has been reduced somewhat. VXML is a language that enables a software developer to focus on the application logic of the voice application without being required to configuring underlying telephony components. Typically, the developed voice application is run on a VXML interpreter that resides on and executes on the associated telephony system to deliver the solution.
As is shown in FIG. 1A (prior art) a typical architecture of a VXML-compliant telephony system comprises a voice application server (110) and a VXML-compliant telephony server (130). Typical steps for development and deployment of a VXML enabled IVR solutions are briefly described below using the elements of FIG. 1A.
Firstly, a new application database (113) is created or an existing one is modified to support VXML. Application logic 112 is designed in terms of workflow and adapted to handle the routing operations of the IVR system. VXML pages, which are results of functioning application logic, are rendered by a VXML rendering engine (111) based on a specified generation sequence.
Secondly, an object facade to server 130 is created comprising the corresponding VXML pages and is sent to server 130 over a network (120), which can be the Internet, an Intranet, or an Ethernet network. The VXML pages are integrated into rendering engine 111 such that they can be displayed according to set workflow at server 110.
Thirdly, the VXML-telephony server 130 is configured to enable proper retrieval of specific VXML pages from rendering engine 111 within server 110. A triggering mechanism is provided to server 110 so that when a triggering event occurs, an appropriate outbound call is placed from server 110.
A VXML interpreter (131), a voice recognition text-to-speech engine (132), and the telephony hardware/software (133) are provided within server 130 and comprise server function. In prior art, the telephony hardware/software 130 along with the VXML interpreter 131 are packaged as an off-the-shelf IVR-enabling technology. Arguably the most important feature, however, of the entire system is the application server 110. The application logic (112) is typically written in a programming language such as Java and packaged as an enterprise Java Bean archive. The presentation logic required is handled by rendering engine 111 and is written in JSP or PERL.
An enhanced voice application system is known to the inventor and disclosed in the U.S. patent application entitled “Method and Apparatus for Development and Deployment of a Voice Software Application for Distribution to one or more Application Consumers” to which this application claims priority. That system uses a voice application server that is connected to a data network for storing and serving voice applications. The voice application server has a data connection to a network communications server connected to a communications network such as the well-known PSTN network. The communication server routes the created voice applications to their intended recipients.
A computer station is provided as part of the system and is connected to the data network and has access to the voice application server. A client software application is hosted on the computer station for the purpose of enabling users to create applications and manage their states. In this system, the user operates the client software hosted on the computer station in order to create voice applications through object modeling and linking. The applications, once created, are then stored in the application server for deployment. The user can control and manage deployment and state of deployed applications including scheduled deployment and repeat deployments in terms of intended recipients.
In one embodiment, the system is adapted for developing and deploying a voice application using Web-based data as source data over a communications network to one or more recipients. The enhanced system has a voice application server capable through software and network connection of accessing a network server and Web site hosted therein and for pulling data from the site. The computer station running a voice application software has control access to at least the voice application server and is also capable of accessing the network server and Web site. An operator of the computer station creates and provides templates for the voice application server to use in data-to-voice rendering. In this aspect, Web data can be harvested from a Web-based data source and converted to voice for delivery as dialogue in a voice application.
In another embodiment, a method is available in the system described above for organizing, editing, and prioritizing the Web-based data before dialog creation is performed. The method includes harvesting the Web-based data source in the form of its original structure; generating an object tree representing the logical structure and content type of the harvested, Web-based data source; manipulating the object tree generated to a desired hierarchal structure and content; creating a voice application template in VXML and populating the template with the manipulated object tree; and creating a voice application capable of accessing the Web-based data source according to the constraints of the template. The method allows streamlining of voice application deployment and executed state and simplified development process of the voice application.
A security regimen is provided for the above-described system. The protocol provides transaction security between a Web server and data and a voice portal system accessible through a telephony network on the user end and through an XML gateway on the data source end. The regimen includes one of a private connection, a virtual private network, or a secure socket layer, set-up between the Web server and the Voice Portal system through the XML gateway. Transactions carried on between the portal and the server or servers enjoy the same security that is available between secure nodes on the data network. In one embodiment, the regimen further includes a voice translation system distributed at the outlet of the portal and at the telephone of the end user wherein the voice dialog is translated to an obscure language not that of the users language and then retranslated to the users language at the telephone of the user.
In such as system where templates are used to enable voice application dialog transactions, voice application rules and voice recognition data are consulted for the appropriate content interpretation and response protocol so that the synthesized voice presented as response dialog through the voice portal to the user is both appropriate in content and hopefully error free in expression. The database is therefore optimized with vocabulary words that enable a very wide range of speech covering many different vocabulary words akin to many differing business scenarios.
According to yet another aspect of the invention, vocabulary recognition is tailored for active voice applications according to client parameters. This is accomplished through a vocabulary management system adapted to constrain voice recognition processing associated with text-to-speech and speech-to-text rendering associated with use of an active voice application in progress between a user accessing a data source through a voice portal. The enhancement includes a vocabulary management server connected to a voice application server and to a telephony server, and an instance of vocabulary management software running on the management server for enabling vocabulary establishment and management for voice recognition software. In practice of the enhanced vocabulary management capability, an administrator accessing the vocabulary management server uses the vocabulary management software to create unique vocabulary sets or lists that are specific to selected portions of vocabulary associated with target data sources the vocabulary sets differing in content according to administrator direction.
It will be appreciated by one with skill in the art of voice application deployment architecture that many users vying to connect and interact with a voice portal may in some cases create a bottleneck wherein data lines connecting voice application components to Web-sources and other data sources become taxed to their capacities. This problem may occur especially at peak use periods as is common for many normal telephony environments. It has occurred to the inventor that still more streamlining in terms of traffic optimization is required to alleviate potential line-use issues described above.
A particular enhancement to the voice application distribution system known to the inventor addresses the traffic challenges described in the above paragraph. Application logics are provided for determining which portions (dialogs) of a voice application for deployment are cached at an application-receiving end system based on static and dynamic rules and in some cases (dynamic caching), statistical analysis results are used in the determination. The application logic utilizes a processor for processing the voice application according to sequential dialog files and rules of the application. Logic components include a static content optimizer connected to the processor for identifying files containing static content; and a dynamic content optimizer connected to the processor for identifying files containing dynamic content. The optimizers determine which files should be cached at which end-system facilities, tag the files accordingly, and prepare those files for distribution to selected end-system cache facilities for local retrieval during consumer interaction with the deployed application.
Being able to retrieve dialog portions of a voice application from a local cache facility increases response time at the voice portal by decreasing the load on the network connection to the voice application server. However, in addition to reduced traffic requirements, it is also important that text to speech recognition and speech to text renderings are clear and accurate. Accuracy of synthesized speech delivered to a caller is key to creating a successful voice application that can be interacted with in a dynamic fashion at both ends.
As voice application distribution architectures expand to cross regional boundaries and even cultural boundaries the prospect of standardizing speech recognition rules dealing with terms and phrases that are commonly spoken becomes increasingly difficult. For example, pronunciations of certain terms in a same language will vary significantly according to region. Common labels such as the way major roads and highways are written and spoken can also vary significantly. There are many examples of phrase and term variations that need to be addressed if voice application interaction is practiced on larger architectures spanning large geographic regions.
What is clearly needed is a method and apparatus for preprocessing text-to-speech renderings according to prevailing locally-dependant parameters and rules that address the types of variances in specific terms, pronunciation parameters, and phonic variations in spoken phrases. Such a method and apparatus would increase accuracy in voice synthesizing of text renderings thereby increasing caller and enterprise satisfaction and further reducing network traffic associated with error message propagation and interaction re-starts.