(Not Applicable)
(Not Applicable)
1. Technical Field
This invention relates to the field of speech enabled computing and more particularly to a device-independent system, method and apparatus for linking a speech driven application to specific audio input and output devices.
2. Description of the Related Art
Speech driven applications differ from traditional GUI based applications in that speech driven applications handle audio data for both input and output. Typically, GUI based applications rely on an input device, such as a mouse or keyboard, for input and on a visual display, such as a monitor, for output. In contrast, speech driven applications rely on an audio input device, such as a microphone, for input and on an audio output device, such as speakers, for output. Typically, audio input data received from the audio input device can be provided via audio circuitry to a speech recognition engine for conversion to computer recognizable text. Similarly, computer recognizable text originating in the speech driven application can be provided to a text-to-speech engine for conversion to audio output data to be provided via audio circuitry to the audio output device.
Presently, speech driven applications require audio input data received from an audio input device to be in a media format suitable for use with a corresponding speech recognition engine. Likewise, speech driven applications require audio output data generated by a text-to-speech engine and provided to an audio output device to be in a media format specific to the audio output device. Yet, audio input and output devices can vary from transducer-type devices such as microphones and speakers to specialized audio circuitry and systems to distributed audio input and output devices remotely positioned across a network. Hence, speech driven application developers have been compelled to handle the receipt and transmission of audio data from and to varying audio input and output sources and corresponding media transport protocols on a case-by-case basis. As a result, substantial complexity necessarily is added to the speech driven application.
There have been several attempts to transport audio data to and from speech driven applications in a manner which frees the speech application developer from varying audio data transmission and receipt methods according to specific audio data input and output sources. Some examples include the multimedia API layer of the Microsoft Windows(copyright) operating system and the multimedia presentation manager of the IBM OS/2(copyright) operating system. However, both examples require highly complex interactions on behalf of the speech application developer and neither permits a simple audio data stream-in/stream-out approach to the transmission and receipt of audio data from varying data sources. In addition, both examples are compiled solutions which are platform specific to a particular hardware configuration and a specific operating system.
The Java(trademark) Media Framework (JMF(trademark)) represents one attempt to transport audio data to and from a speech driven application in a hardware and operating system neutral device. JMF is fully documented in the Java Media Framework API Guide (JMF API Guide) published by Sun Microsystems, Inc. of Mountain View, Calif. on Nov. 19, 1999 (incorporated herein by reference) and the Java Media Framework Specification (JMF Specification) also published by Sun Microsystems, Inc. on Nov. 23, 1999 (incorporated herein by reference). As will be apparent from both the JMF API Guide and the JMF Specification, although unlike previous operating system dependent solutions, JMF is a Java-based platform independent solution, the use of JMF to provide audio data to and from a speech driven application remains a daunting task. In particular, JMF requires the speech driven application developer to specify several device-dependent parameters, for example media transport protocol, and media transport specific parameters, for example frame size and packet delay. Hence, a speech application developer using JMF must maintain an awareness of the device characteristics for the audio input and output sources.
For example, audio data transmitted in a European telephony network typically is A-law encoded. In contrast, audio data transmitted over a U.S. telephony network typically is xcexc-law encoded. As a result, in order for a JMF-based speech driven application to handle audio data transmitted over a European telephony network, proper settings consonant with the A-law encoding of audio data must be known by the speech driven application developer and specifically applied to the speech driven application in addition to other settings such as transport protocol and packet delay. Thus, what is needed is a device-independent system, method and apparatus for linking a speech driven application to specific audio input and output devices.
The present invention is an audio abstractor that provides a device independent approach to enable a speech driven application to receive and transmit digitized speech audio to and from audio input and output devices. In particular, the audio abstractor can provide a device-independent interface to speech driven applications through which the speech driven applications can access digitized speech audio from specific audio input and output devices without having to specify device-specific parameters necessary to interact with those specific audio input and output devices. Rather, the audio abstractor can be configured to interact with specific audio input and output devices, for example through a media framework, thereby off-loading from the speech driven application the complexity of audio device configuration.
A device-independent speech audio system for transparently linking a speech driven application to specific audio input and output devices can include a media framework for transporting digitized speech audio between speech driven applications and a plurality of audio input and output devices. The media framework can include selectable device-dependent parameters which can enable the transportation of the digitized speech to and from the plurality of audio input and output devices. The device-independent speech audio system also can include an audio abstractor configurable to provide specific ones of the selectable device-dependent parameters according to the specific audio input and output devices. Hence, the audio abstractor can provide a device-independent interface to the speech driven application for linking the speech driven application to the specific audio input and output devices.
In a representative embodiment of the present invention, the device-independent speech audio system can be used in conjunction with a speech recognition system. Accordingly, the device-independent speech audio system can further include a speech recognition engine communicatively linked to the device-independent interface of the audio abstractor. In consequence, the speech recognition engine can receive the digitized speech audio from a specific audio input device via the audio abstractor without specifying the specific ones of the device-dependent parameters. Also, the speech recognition engine can convert the received digitized speech audio to computer readable text. Finally, the speech recognition engine can provide the converted computer readable text to the speech driven application.
In another representative embodiment of the present invention, the device-independent speech audio system can be used in conjunction with a text-to-speech (TTS) engine. Accordingly, the device-independent speech audio system can further include a text-to-speech (TTS) engine communicatively linked to the device-independent interface of the audio abstractor. The TTS engine can convert computer readable text received from the speech driven application into the digitized speech audio. In consequence, the TTS engine can transmit the digitized speech audio to a specific audio output device via the audio abstractor without specifying the specific ones of the device-dependent parameters.
Notably, the interface of the present invention can include a device-independent method for opening a buffer for receiving the digitized speech audio from a specific audio input source. Similarly, the interface can include a device-independent method for opening a buffer for transmitting the digitized speech audio to a specific audio output source.
The device-dependent parameters can include an encoding type parameter. Furthermore, the device-dependent parameters can include sample rate; sample size; and, channels. Moreover, the device-dependent parameters can include byte order; and, signed/unsigned format. Finally, the device-dependent parameters can include frame size; frame rate; and, data type. Notably, in one embodiment of the invention, the media framework can be the Java Media Framework (JMF).
In a representative embodiment of the present invention, the specific audio input and output devices can be remotely positioned from the speech driven application in a computer communications network. As such, the speech driven application can be employed in an IVR system in a node in a computer communications network. Where the specific audio input and output devices are remotely positioned from the speech driven application in a computer communications network, the specific audio input and output devices can be configured to place and receive telephone calls. In that regard, the telephone calls can be converted to digitized speech audio through a telephony interface to the computer communications network.
The present invention also can include a method for linking a speech driven application to specific audio input and output devices. Specifically, the method can include configuring an input buffer to receive digitized speech audio from a specific audio input device; configuring an output buffer to transmit digitized speech audio to a specific audio output device; providing device-independent methods for accessing the buffers; and, transporting digitized speech audio between the speech driven application and the specific audio input and output devices through the buffers via the device-independent methods. Notably, the speech driven application need not specify device-dependent parameters necessary to transport the digitized speech audio between the audio input and output sources. In a representative embodiment of the method of the invention, the step of configuring can include selecting in the device-independent methods at least one method in a media framework for configuring the buffers according to device-dependent parameters necessary to transport the digitized speech audio between the specific audio input and output devices.
In a representative embodiment of the present invention, the device-independent speech audio system can be used in conjunction with a speech recognition system. Accordingly, the method of the invention also can include communicatively linking a speech recognition engine to the input buffer; transporting the digitized speech audio from the specific audio input device to the speech recognition engine through the input buffer without specifying the device-dependent parameters; converting the digitized speech audio to text in the speech recognition engine and providing the converted text to the speech driven application.
In another representative embodiment of the present invention, the device-independent speech audio system can be used in conjunction with a text-to-speech (TTS) engine. Accordingly, the method of the invention can further comprise communicatively linking a text-to-speech (TTS) engine to the output buffer; converting computer readable text in the speech driven application to the digitized speech audio in the TTS engine; and, transporting the digitized speech audio from the TTS engine to the specific audio output device through the output buffer without specifying the device-dependent parameters.