Within the last several years, the growth of e-commerce and e-business has greatly increased the number of available channels for customers to contact businesses. Businesses have invested in a variety of e-commerce programs from informational Web sites to transactional sites to Web-based applications. Businesses have also begun to develop dedicated portals for business-to-business (B2B) partners as well as employees. Using these various channels, businesses have increased the exposure of their products and have also increased the modes of interaction with customers, partners, and employees. To keep up with the growing e-commerce model, companies now typically consider and plan for interactions through multiple contact points, such as personal computers (PCs), personal data assistants (PDAs), email pagers, Web- and data-enabled mobile phones, in addition to the traditional call centers.
In traditional call centers, businesses enjoy a certain level of automation through interactive voice response (IVR) systems that would typically operate using voice- and dual tone multiple frequency (DTMF)-recognition to obtain data and that would interact with callers through speech response, either pre-recorded or synthetically generated. IVRs allow for a better user-experience while decreasing staffing costs. With the addition of Web-based access, some of the convenience of speech interactivity has been replaced with the convenience of keyboard entry at a computer. However, now, PDAs, web-enabled mobile phones, two-way text pagers, and other devices capable of multimodal communication, add new access points for which a well-established solution has yet to be developed.
One emerging solution centers on the convergence of telephony with the Web. PDAs, which originally began as purely mobile data devices are now being developed with voice capabilities. Similarly, mobile phones, which originally were limited to pure telephony functionality, are now being developed to interact with the Web and other data networks. Furthermore, as wireless networks evolve to 2.5 G and 3 G systems, the increased data throughput generally supports more robust mobile Web-based applications and services.
MICROSOFT™ CORPORATION has developed Speech Application Language Tags (SALT) to add a powerful speech interface to Web pages, while maintaining and leveraging all the advantages of the Web application model. SALT is one of the many languages derived from standard generalized markup language (SGML), such as hypertext markup language (HTML), extensible HTML (XHTML), wireless markup language (WML), and the like. Another speech interface language is VoiceXML™ (VXML). VXML is an XML application which, when combined with voice recognition technology, enables interactive access to the Web through a telephone or voice-driven browser. The main difference between VXML and SALT is that VXML utilizes client-side execution while SALT utilizes server-side execution.
SALT tags are designed to be used for both voice-only browsers (i.e., browsers accessible over a telephony server) and multimodal browsers, such as a typical Web browser like MICROSOFT INTERNET EXPLORER™. SALT is actually a small set of extensible markup language (XML) elements, with associated attributes and document object model (DOM) object properties, events, and methods, that may be used in conjunction with a source markup document to apply a speech interface to the source page. Because SALT maintains a relatively strict syntax, independent from the nature of the source documents, it can generally be used effectively within most flavors of HTML, WML, or other such SGML-derived markup languages.
MICROSOFT™ CORPORATION'S NET Speech SDK, Beta 2 comprises MICROSOFT™ CORPORATION'S ASP.NET controls, a Speech Add-in for MICROSOFT™ CORPORATION'S INTERNET EXPLORER™, and numerous libraries and sample applications. The development tools for implementing the speech integration with SALT are provided in MICROSOFT™ CORPORATION'S Visual Studio .NET™. In its implementation, the MICROSOFT™ architecture generally includes the integral use and intercommunication between SALT-enabled browsers.
FIG. 1 is a block diagram illustrating converged voice-Web system 10. Converged voice-Web system 10 illustrates the architecture implemented in the MICROSOFT™-offered solution. The core of the system comprises telephony server 100, speech server 101, and web server 102. Web server 102 facilitates Web-access by client 103 through Internet 11, while telephony card 104, which may be a T1 or E1 interface card, facilitates telephone access by clients 110, 111, and 109 through the public switched telephone network (PSTN) 105, private branch exchange (PBX) 111, and mobile switching center (MSC) 107. Client 108 may be a PDA or other dual mode wireless device that may access converged voice-Web system 10 through telephony card 104 or Internet 11. Similarly, client 109 may be a multi mode mobile phone capable of either telephony operation, accessing system 10 through telephony card 104 or Web-capable, accessing system 10 through Internet 11.
In an example where a client accesses system 10 via PSTN 106, client 111 accesses telephone server 100 through telephone card 104. Telephone server 100 operates closely with speech server 101 to facilitate audible caller interaction. Speech server 101 typically includes a voice component driver, such as a SALT driver. Upon accessing telephone server 100, telephone server 100 accesses Web server 102 to obtain application server page 112. Application server page 112 includes SALT tags that identify voice-functionality in the application. Telephone server 100 processes application server page 112. When the embedded SALT tags are encountered, telephone server 100 accesses the SALT driver in speech server 101 to execute the SALT code. Any SALT tags that are received from application server page 112 would be executed on speech server 101 with the call control maintained on telephony server 100. Once the code has finished processing, the results are returned to telephony server 100, which either completes the call or disconnects depending on the application being run.
In a multimodal example, client 103 accesses Web server 102 through Internet 11. Application server page 112 is processed on Web server 102 to return the Web pages for display on client 103. In order to take advantages of the SALT functionality written into application server page 112, client 103 typically needs a browser that includes a SALT driver along with text-to-speech (TTS) resources. In existing implementations of system 10, individual pieces of call flow have been written as separate components, such that a developer may create an entire call flow for performing a specific task by assembling the several different piece components. However, as indicated, client 103 must typically include a browser with SALT capabilities along with TTS resources. Furthermore, because application server page 112 includes speech components, the application developer generally must be familiar with the specific speech interface language being utilized, such as SALT, VXML, or the like.
FIG. 2 is a block diagram illustrating existing proprietary converged voice-Web system 20. In proprietary system 20, application server page 202 may be created with specific voice modules 203-205 embedded within the HTML or similar language representation of a Web page. As client 201 accesses Web server 200, application server page 202 is called and executed. Proprietary system 20 monitors the execution of application server page 202 and when each of voice modules 203-205 are encountered, system 20 generates code, such as SUN MICROSYSTEM's JAVASCRIPT™ or the like, that is executable on Web server 200 to provide the specific speech components. Such proprietary systems are typically programmed specifically to provide any certain, desired features. However, because of the proprietary nature, only compatible systems and software may be used, thus, limiting the accessibility of voice modules 203-205.
A further improvement has been suggested by SANDCHERRY, INC., to make media resources, such as text-to-speech (TTS) resources, automatic speech response (ASR) resources, and the like available, to a central application as component servers using one of the signaling protocols for voice over Internet (VoIP), session initiation protocol (SIP). FIG. 3 is a block diagram illustrating SIP-based local component server system 30. Enterprise local area network (ELAN) 300 is typically in communication with caller 301 through media gateway 302. Caller 301 connects to Web server 305 to access a particular application. Web server 305 interfaces with enterprise content 306 that provides the voice site content, such as VXML, SALT tags, prompts, grammars, and the like. As the application accessed by caller 301 requires media resources, such as service controller 305, TTS server 307, and ASR server 310, ELAN 300 communicates with the necessary media resource through each one's SIP interface 306, 308, and 311. Additionally, because TTS 307 and ASR 310 include live speech transfer, those media resources include real-time protocol interfaces 309 and 312. Web server 305 processes the information from the media resources and from enterprise content 306 and implements the resulting voice pages through VXML browser 303 and it's SIP interface 304. Caller 301 is then able to interact with the desired application with the voice/speech information processed through VXML browser 303. Local component server system 30 is managed and monitored by voice site monitor 313.
The component server system described by SANDCHERRY, INC., has also been suggested in a remote provision orientation. FIG. 4 is a block diagram illustrating SIP-based distributed component server system 40. In distributed component server system 40, the desired application is provisioned through an application service provider (ASP) on ASP local area network (ASP-LAN) 401. ASP-LAN 401 communicates with caller 402 through media gateway 403. As caller 402 wishes to interact with the desired application, the application is accessed through Web server 404 and ASP hosted sites 405. Furthermore, as the application requires media resources, it may access the media resource component servers, such as service controller 408, VXML browser 406, ASR server 410, TTS server 413, or the like, through each resource's SIP interface 407, 409, 411, 414. Additionally, as before, ASR server 410 and TTS server 413 are also serviced by RTP interfaces 412 and 415, respectively.
Although the systems in FIGS. 3 and 4 show implementing the component server system using VXML, SANDCHERRY, INC., has also suggested implementation of this system using other open architecture features such as SALT, SOAP, and Web services.