1. Technical Field
This invention relates to the field of voice recognition and more particularly to a speech application for use in a Voice over IP protocol network.
2. Description of the Related Art
LAN telephony, which means xe2x80x9cthe integration of telephony and data services provided by packet-switched data networks,xe2x80x9d is the technology that takes person-to-person communication to a high new level and associated costs to a lower level. LAN telephony enables a more flexible and cost-efficient use of many applications, for example automated call distribution, interactive voice response, voice logging, etc. This is in contrast to the relatively limited integration offered by the current voice/data integration paradigm, computer-telephony integration in which voice traffic is kept separate from data traffic and carried over circuit-switched links. Whereas the old paradigm for integrating data and voice has been to use the circuit-switched telephony fabric for data communications, the obvious drawbacks of the relatively low bandwidth available to data traffic, the inefficiency of circuit-switched data communications due to the xe2x80x9cburstyxe2x80x9d nature of data traffic, and the limited voice/data integration possibilities have led to present topologies in which IP data servers are bundled with proprietary PBXs or voice circuit switches in order to provide a loose integration between circuit and packet-switched networks and voice is carried by the circuit-switched network.
One of the most common uses of LAN telephony is in the enterprise Internet/Intranet environment, referred to as IP telephony. The Voice over IP (xe2x80x9cVoIPxe2x80x9d) protocol is the protocol upon which voice traffic can be transmitted across IP networks. In a VoIP network, analog speech signals received from an analog speech audio source, for example a PSTN or a microphone, are digitized, compressed and translated into IP packets for transmission over an IP network. Several well-known protocols implement the VoIP protocol specification including H.323, Session Initialization Protocol (xe2x80x9cSIPxe2x80x9d) and Master Gateway Control Protocol (xe2x80x9cMGCPxe2x80x9d).
A common application for IP telephony is the integration of voice mail (xe2x80x9cv-mailxe2x80x9d) and electronic mail (xe2x80x9ce-mailxe2x80x9d). Another application can include voice logging by financial or emergency-response organizations. Additionally, automated call distribution (xe2x80x9cACDxe2x80x9d) can be facilitated whereby an ACD server performs value-based queuing of incoming telephone calls. Finally, interactive voice response systems can incorporate IP telephony in which responses are preprogrammed in a server as a workflow component. Still, speech recognition and speech synthesis applications (xe2x80x9cspeech applicationsxe2x80x9d) have lagged in the use of IP telephony.
In particular, speech applications operate on real-time audio signals which cannot tolerate latencies associated with traditional data communications. As such, where speech applications have been incorporated in an IP telephony topology, the speech applications have been closely integrated with IP telephony server in order to preclude a negative impact from network based latencies. Accordingly, the design and development of such IP telephony enabled speech applications have been closely linked to the proprietary nature of the IP telephony server.
The tight linkage between the speech application and the IP telephony server substantially limits both the design and the extensibility of the speech application. Specifically, in the present paradigm the speech application design must incorporate functionality directly related to the chosen protocol for transporting packetized voice data to a speech recognition system and from a speech synthesis system. The development of a superior voice transport protocol, by nature of the tight linkage between the IP telephony server and the speech application, can compel the redesign of the speech application. Accordingly, there exists a need for a speech a VoIP-based speech system in which the design and implementation of the speech application remains separate from the design and implementation of the IP telephony system.
It is an object of the present invention to provide a VoIP-based speech system in which the design and implementation of the speech application remains separate from the design and implementation of the IP telephony system. It is a further object of the present invention to provide a VoIP-enabled speech server which can receive audio input from the IP telephony system over a VoIP network. It is yet another object of the present invention to provide a method for coupling a speech application to a telephony gateway server in a VoIP network. Finally, it is an object of the present invention to provide each of the VoIP-based speech system, the VoIP-enabled speech server and the method for coupling the speech application to the telephony gateway server using standards-based interfaces to the VoIP network, the t server and the speech application.
These and other objects of the present invention are accomplished in a VoIP-based speech system including: a VoIP telephony gateway server; at least one speech server, each speech server containing a VoIP-enabled speech application; a VoIP-compliant call control interface between the VoIP telephony gateway server and the speech server; and, a VoIP communications path between the VoIP telephony gateway-server and the speech application in the at least one speech server. In the VoIP-based speech system, the VoIP telephony gateway server and the speech application can establish the VoIP communications path through the VoIP-compliant call control interface.
In operation, the VoIP telephony gateway server can receive audio signals from a telephony interface, digitize the audio signals into digitized audio data, compress the digitized audio data into VoIP-compliant packets, and transmit the VoIP-compliant packets to the speech application in the at least one speech server through the VoIP communications path using the VoIP protocol. Correspondingly, the speech application can receive the VoIP-compliant packets, reconstruct the digitized audio data from the VoIP-compliant packets, and speech-to-text converting the digitized audio data. In addition, the speech application can synthesize text into digitized audio data, encapsulate the digitized audio data in VoIP-compliant packets and transmit the VOIP-compliant packets through the VoIP communications path to the VoIP telephony gateway server. Subsequently, the VoIP telephony-gateway. server can receive the VoIP-compliant packets, reconstruct the digitized audio data from the VoIP-compliant packets, and transmit the digitized audio data through the telephony interface.
In one aspect of the present invention, the VoIP telephony server can include a telephony interface and a VoIP Gatekeeper. The VoIP Gatekeeper can receive a voice call through the telephony interface, and responsively, the VoIP Gatekeeper can choose a speech server from among the speech servers. Once a speech server has been chosen, the VoIP Gatekeeper can alert the VoIP-enabled speech application in the chosen speech server that the voice call has been received.
In another aspect of the present invention, the speech server can include a speech recognition engine; a text-to-speech engine; a call control interface for establishing a voice call connection through the VoIP telephony gateway server; and, an audio data path. Notably, the audio data path can stream audio data through the established voice call connection to the speech recognition engine. Similarly, the audio data path can stream audio data through the established voice call connection from the text-to-speech engine.
In yet another aspect of the present invention, the speech application can be a speech browser. The speech browser can retrieve Web content responsive to voice commands received through the VoIP communications path. Also, the speech browser can speech synthesize the retrieved Web content into audio data. Finally, the speech browser can transmit the audio data through the VoIP communications path to the VoIP telephony gateway server. Significantly, the Web content can be a VoiceXML document.
Preferably, the speech server can be implemented using standards-based interfaces to the VoIP telephony gateway server, the VoIP communications path, and the speech application. Specifically, the speech server can include a speech recognition engine; a text-to-speech engine; a JSAPI speech interface; a JTAPI telephony interface; and a JMF media interface. The JTAPI telephony interface can establish a voice call connection for transporting digital audio data between the Ages telephony gateway server and the speech application. The JMF media interface can establish a data path for transporting the digital audio data between the speech application and the voice call connection. The JSAPI speech interface can communicate the digitized audio data from the speech application to the speech recognition engine. Similarly, the JSAPI speech interface can communicate speech synthesized audio data from the text-to-speech engine to the speech application.
The present invention can also be embodied in a VoIP-enabled speech server which can include a speech application which can be configured to communicate with a VoIP telephony gateway server over a VoIP communications path. The VoIP-enabled speech server can also include a VoIP-compliant call control interface to the VoIP telephony gateway server, the VoIP-compliant call control interface establishing the VoIP communications path. In operation, the speech application can receive VoIP-compliant packets from the VoIP telephony gateway server over the VoIP communications path. Subsequently, digitized audio data can be reconstructed from the VoIP-compliant packets, and the digitized audio data can be speech-to-text converted. Additionally, text can be synthesized into digitized audio data and the digitized audio data can be encapsulated in VoIP-compliant packets which can be transmitted over the VoIP communications path to the telephony gateway server.
In another aspect of the VoIP-enabled speech server, the VoIP-enabled speech server can include a speech recognition engine, a text-to-speech engine and an audio data path. The audio data path can stream audio data through the established voice call connection to the speech recognition engine. Also, the audio data path can stream audio data through the established voice call connection from the text-to-speech engine.
Preferably, the speech application is a speech browser. The speech browser can retrieve Web content responsive to voice commands received through the VoIP communications path. The speech browser can also speech synthesize the retrieved Web content into audio data. Subsequently, the speech browser can transmit the audio data through the VoIP communications path to the VoIP telephony-gateway-server. Significantly, the Web content can be a VoiceXML document.
Preferably, the VoIP-enabled speech server can be implemented using standards-based interfaces to the VoIP telephony gateway server, the VoIP communications path, and the speech application. Specifically, the VoIP-enabled speech server can include a JTAPI telephony interface for establishing a voice call connection for transporting digital audio data between the telephony gateway server and the speech application. Additionally, the VoIP-enabled speech server can have a JMF media interface for establishing a data path for transporting the digital audio data between the speech application and the voice call connection. Finally, the VoIP-enabled speech server can have a JSAPI speech interface both for communicating the digitized audio data from the speech application to the speech recognition engine, and for communicating speech synthesized audio data from the text-to-speech engine to the speech application.
Finally, the present invention can include a method for coupling a speech application to a telephony gateway server in a VoIP network. The method can include the steps of establishing a VoIP communications path with the VoIP telephony gateway server and configuring the speech application to communicate with the telephony gateway server over the established VoIP communications path. Additionally, VoIP-compliant packets can be received from the telephony gateway server over the established VoIP communications path. Digitized audio data can be reconstructed from the VoIP-compliant packets and, subsequently, the digitized audio data can be speech-to-text converted. Additionally, the method can include the steps of synthesizing text into digitized audio data; encapsulating the digitized audio data in VoIP-compliant packets; and, transmitting the VoIP-compliant packets over the VoIP communications path to the telephony gateway server.
In the preferred embodiment, the method can further include the steps of retrieving Web content responsive to speech recognized voice commands received through the VoIP communications path; synthesizing the retrieved Web content into audio data; and, transmitting the audio data through the VoIP communications path to the telephony gateway server. Significantly, the Web content can be a VoiceXML document.