1. Field of the Invention
The field of the invention is data processing, or, more specifically, methods, apparatus, and products for synchronizing distributed speech recognition.
2. Description of Related Art
User interaction with applications running on small devices through a keyboard or stylus has become increasingly limited and cumbersome as those devices have become increasingly smaller. In particular, small handheld devices like mobile phones and PDAs serve many functions and contain sufficient processing power to support user interaction through other modes, such as multimodal access. Devices which support multimodal access combine multiple user input modes or channels in the same interaction allowing a user to interact with the applications on the device simultaneously through multiple input modes or channels. The methods of input include speech recognition, keyboard, touch screen, stylus, mouse, handwriting, and others. Multimodal input often makes using a small device easier.
Multimodal applications often run on servers that serve up multimodal web pages for display on a multimodal browser. A ‘multimodal browser,’ as the term is used in this specification, generally means a web browser capable of receiving multimodal input and interacting with users with multimodal output. Multimodal browsers typically render web pages written in XHTML+Voice (‘X+V’). X+V provides a markup language that enables users to interact with an multimodal application often running on a server through spoken dialog in addition to traditional means of input such as keyboard strokes and mouse pointer action. X+V adds spoken interaction to standard web content by integrating XHTML (eXtensible Hypertext Markup Language) and speech recognition vocabularies supported by VoiceXML. For visual markup, X+V includes the XHTML standard. For voice markup, X+V includes a subset of VoiceXML. For synchronizing the VoiceXML elements with corresponding visual interface elements, X+V uses events. XHTML includes voice modules that support speech synthesis, speech dialogs, command and control, and speech grammars. Voice handlers can be attached to XHTML elements and respond to specific events. Voice interaction features are integrated with XHTML and can consequently be used directly within XHTML content.
The performance of speech recognition systems receiving speech that has been transmitted over voice channels, particularly mobile channels, can be significantly degraded when compared to using an unmodified signal. The degradations are as a result of both relatively low bit rate speech coding and channel transmission errors. A Distributed Speech Recognition (‘DSR’) system addresses these problems by eliminating the speech channel and instead using an error protected data channel to send a parameterized representation of the speech, which is suitable for recognition. The processing is distributed between the terminal (‘DSR client’) and a voice server. The DSR client performs the feature parameter extraction, or the front-end of the speech recognition function. The speech features then are transmitted over a data channel to a remote “back-end” recognizer on a voice server. This architecture substantially reduces transmission channel effects on speech recognition performance.
Naively connecting a DSR client to a voice server, however, is not without remaining challenges. One troubling issue is that data communications connections for voice, such as, for example, SIP/RTP connections or H.323 connections, typically experience significant delay, delay which itself varies over time. Such delay presents echo effects and also results in an echo from a voice prompt being fed back to the voice server, thereby degrading speech recognition accuracy. Perfect echo removal on the DSR client may mitigate this issue, but perfect echo removal is very difficult to achieve. Estimating the delay, for example, and starting recognition only after the channel delay is problematic because of the stochastic nature of packet switching; the delay varies from packet to packet of speech data.
In addition, when accessing the service over limited bandwidth networks such as cellular networks, it is useful to send a minimal amount of data so as to conserve bandwidth. Voice Activity Detection (‘VAD’) and discontinuous transmission (‘DTX’) are prior art methods of conserving bandwidth, but neither is effective unless the voice server is waiting for speech data. A common solution to both problems, echo prevention and bandwidth conservation, is push to talk (‘PTT’) switching, which sends speech data only when a switch is manually pressed. This solution is not acceptable in many multimodal applications because of user experience issues. X+V provides a <sync> markup element for use in synchronizing the receipt of spoken information and visual elements. The <sync> element, however, is of no use in synchronizing voice server readiness with the beginning of speech from a user for purposes of echo elimination and bandwidth conservation. For all these reasons, therefore, there is therefore an ongoing need for improvement in synchronizing distributed speech recognition.