1. Field of the Invention
The field of the invention is data processing, or, more specifically, methods, apparatus, and products for providing voice over the Internet Protocol (‘VOIP’) barge-in support for a half-duplex distributed speech recognition (‘DSR’) client on a full-duplex network
2. Description of Related Art
User interaction with applications running on small devices through a keyboard or stylus has become increasingly limited and cumbersome as those devices have become increasingly smaller. In particular, small handheld devices like mobile phones and PDAs serve many functions and contain sufficient processing power to support user interaction through other modes, such as multimodal access. Devices which support multimodal access combine multiple user input modes or channels in the same interaction allowing a user to interact with the applications on the device simultaneously through multiple input modes or channels. The methods of input include speech recognition, keyboard, touch screen, stylus, mouse, handwriting, and others. Multimodal input often makes using a small device easier. Multimodal applications often run on servers that serve up multimodal web pages for display on a multimodal browser. A ‘multimodal browser,’ as the term is used in this specification, generally means a web browser capable of receiving multimodal input and interacting with users with multimodal output. Multimodal browsers typically render web pages written in XHTML+Voice (‘X+V’). X+V provides a markup language that enables users to interact with an multimodal application often running on a server through spoken dialog in addition to traditional means of input such as keyboard strokes and mouse pointer action. X+V adds spoken interaction to standard web content by integrating XHTML (extensible Hypertext Markup Language) and speech recognition vocabularies supported by VoiceXML. For visual markup, X+V includes the XHTML standard. For voice markup, X+V includes a subset of VoiceXML. For synchronizing the VoiceXML elements with corresponding visual interface elements, X+V uses events. XHTML includes voice modules that support speech synthesis, speech dialogs, command and control, and speech grammars. Voice handlers can be attached to XHTML elements and respond to specific events. Voice interaction features are integrated with XHTML and can consequently be used directly within XHTML content.
The performance of speech recognition systems receiving speech that has been transmitted over voice channels, particularly mobile channels, can be significantly degraded when compared to using an unmodified signal. The degradations are as a result of both relatively low bit rate speech coding and channel transmission errors. A Distributed Speech Recognition (‘DSR’) system addresses these problems by eliminating the speech channel and instead using an error protected data channel to send a parameterized representation of the speech, which is suitable for recognition. The processing is distributed between terminal or client device and a voice server. The client performs the feature parameter extraction—the front-end of the speech recognition function. The speech features then are transmitted over a data channel to a remote back-end recognizer or speech recognition engine on a voice server. This architecture substantially reduces transmission channel effects on speech recognition performance.
In many instances of client devices, however, the client device is capable of half-duplex communication only. That is, the client device may accept audio input from a user or deliver audio output to a user but cannot do both at the same time. The data communications full-duplex network that connects the half-duplex DSR client to a voice server typically is a full-duplex network, capable of sending data and receiving data to and from a client device both at the same time. Barge-in is a feature of a DSR system that allows a user to interrupt at any time, while a prompt is playing, permitting natural exchange between an experienced user and the DSR system. In a DSR system that supports half-duplex DSR clients, barge-in support is a challenge because audio input, from TTS conversion on a voice server, for example, may be played to a user despite the fact that the user attempted a barge-in—because of the half-duplex nature of the DSR client that can send or receive audio, but not both at the same time.