1. Field of the Invention
The present invention relates to the field of telecommunications and, more particularly, to a telecommunications voice server that establishes call-based audio sockets.
2. Description of the Related Art
The Websphere Application Server (WAS) by International Business Machines, Corporation (IBM) of Armonk, N.Y. can be utilized by a telecommunications voice server. When so utilized, WAS can handle a multitude of telephony related tasks, a few of which can require services of external speech engines. The speech engines can perform speech-to-text conversions, text-to-speech conversions, and other automated speech related functions for the WAS.
Many speech engines, such as the IBM automatic speech recognition (ASR) engine, can use customizable dynamic link libraries (DLLs) to define different audio sources. The use of customizable DLLs permits the speech engines to modularly handle a breadth of different audio sources, different audio formats, and different audio codecs. Using the DLLs, the speech engines can act as audio socket servers, dynamically establishing ports for exchanging information with external components. Further, the speech engines can include application program interfaces (APIs) for facilitating information exchanges. For example, the IBM ASR engine includes an API called the Speech Manager API (SMAPI), which can be used by the WAS to communicate with the IBM ASR. More specifically, a telephony and media (T&M) subsystem of the WAS can interface with the IBM ASR via SMAPI, where the T&M subsystem is generally responsible for performing media conversions between the WAS and a telephony gateway, between the WAS and speech engines, and/or between the speech engines and the telephony gateway.
In operation, a telephony call can be received that requires WAS operations. In response to call establishment, the WAS can be initialized. Initialization includes activating the T&M subsystem to detect audio utterances occurring within the established call. When an utterance is detected, the T&M subsystem can briefly cache the utterance as the WAS determines appropriate actions to perform. One possible action involves speech-to-text converting the utterance. To perform this conversion, the WAS assigns a speech engine to handle the utterance. The speech engine dynamically establishes an audio socket. An identifier for the audio socket is conveyed through the WAS to the T&M subsystem. Upon receiving the identifier, the T&M subsystem conveys the utterance to the selected speech engine via the established audio socket. Once the utterance has been processed by the speech engine, the connection between the T&M subsystem is terminated and the audio socket is closed and/or reallocated for other processing tasks.
It should be appreciated that the WAS, like most high volume servers, performs turn based speech engine allocations as opposed to call based allocations. Turn based allocation techniques dynamically assign discrete work units or turns to speech engines as needed. Call based allocation techniques provide a 1-1 speech engine to telephone call mapping. As speech engines are typically costly and consume extensive computing resources, cost effective telephony solutions do not generally perform call-based allocation, but rather perform turn-based allocation of speech engines, thereby maximizing the usage of expensive speech engine components.
The aforementioned approach for utilizing speech engines, however, can be problematic. One such problem is that numerous turns for processing different utterances are commonly performed during each telephone call. For each turn, the T&M subsystem conveys audio signals to a particular speech engine via a specified audio socket. Accordingly, throughout the call, the T&M subsystem handles continuously changing audio ports that are dynamically allocated by the various speech engines. Moreover, each time a speech engine allocates an audio socket, the host/port/protocol for the audio socket established by the speech engine must be conveyed to the T&M subsystem before audio signals can be conveyed between the T&M subsystem and the speech engine.
Conveying the audio socket information from the speech engine to the T&M subsystem can result in processing delays. These delays can be pronounced when the voice server through which the socket information is conveyed has a componentized and functionally isolated architecture, as does the WAS. Appreciably, such an architecture does not constantly maintain a call-based control path between the T&M subsystem and the speech engine. A skilled artesian can recognize that this approach is subject to numerous bottlenecks which can be problematic when the voice server, T&M subsystem, and/or the speech engines are placed under significant loads. Consequently, it would be highly advantageous to utilize a different approach that reduces latencies resulting from these bottlenecks.