1. Technical Field
The invention relates to the field of telephony and, more particularly, to improving audio quality in telephony devices.
2. Description of the Related Art
Many legacy telephony devices, for example telephony audio interface cards, do not support audio streaming, that is, the real-time transmission of multimedia data, for example using real-time transport protocol. Such non-streaming devices are designed to support interactive voice response applications in a mode referred to as half duplex. An audio interface operating in half duplex mode cannot send and receive audio simultaneously. Rather, the audio interface provides audio, and when finished playing a complete audio prompt, enters a record mode to receive audio from a caller for processing.
Modern telephony platform products such as Websphere Voice Server, available from International Business Machines Corporation of Armonk, N.Y., and other products which can include speech recognition and text-to-speech functions as are commonly provided as part of interactive voice response (IVR) systems, can simultaneously send and receive audio. That is, such products, referred to as full duplex products, can receive audio for processing within a speech recognition engine while sending audio generated by a text-to-speech engine, a voice browser, or a voice server over the telephony system. Although pre-recorded audio clips can be provided to a caller as well, full duplex products allow users to “barge-in” or interrupt the playing of a long audio prompt with a spoken response.
TTS engines, voice browsers, voice servers, and/or other audio sources utilized within full duplex systems which stream audio tend to send a series of randomly sized portions of audio comprising a complete audio prompt. When audio portions are provided to legacy telephony interfaces for playback to a caller in this manner, the legacy telephony audio interface is not able to determine a start point or an end point for a complete audio prompt. Rather, the telephony audio interface plays each audio portion as the audio portion is received as if the audio portion were a complete audio prompt.
Typically, when non-streaming telephony audio interfaces receive a first portion of an audio prompt, the interface enters a play mode during which time no further portions of the audio prompt can be received. In fact, such audio interfaces cannot be provided with an additional portion of the audio prompt until such time after the telephony audio interface has finished playing the first portion of the audio prompt and has issued a message notifying the audio source that the telephony audio interface is ready to receive the next portion of the audio prompt. Issuance of the message to the audio source can take approximately 10–20 milliseconds. In consequence, spaces or periods of silence frequently occur between successive audio portions of a single, complete audio prompt when played through a conventional non-streaming telephony audio interface. The periods of silence within the audio prompt are due at least in part to the time for the telephony audio interface to issue the ready message after playing a portion of audio, the network latency for routing the message from the telephony audio interface to the audio source, as well as the time needed to send the next portion of the audio prompt from the audio source to the interface.
Several disadvantages can result when silence is disposed between successive audio portions of an audio prompt. One disadvantage is that callers frequently experience a significant amount of latency when listening to streamed audio as played through a legacy audio interface. “Latency”, as defined herein, refers to the time required for a caller to receive a response from an audio source such as an IVR application after the caller speaks a command.
Another disadvantage of using a legacy telephony audio interface to process streamed audio is that the silence interspersed between successive audio portions of a single, complete audio prompt, can be perceived as unnatural sounding audio prompts. In some cases, the resulting audio not only sounds unnatural, but may be perceived by callers as being distorted. For example, such is the case where an individual word is broken between two successive audio portions of a single audio prompt. In that case, the audio interface plays one audio portion of the complete prompt with the audio portion ending in the middle of a word, pauses, and then plays the subsequent audio portion of the audio prompt which begins with the second half of the broken word. When played in this manner, the resulting audio prompt sounds distorted.