Speech recognition systems are generally known in the art, particularly in relation to telephony systems. U.S. Pat. Nos. 4,914,692; 5,475,791; 5,708,704; and 5,765,130 illustrate exemplary telephone networks that incorporate speech recognition systems. A common feature of such systems is that the speech recognition element (i.e., the device or devices performing speech recognition) is typically centrally located within the fabric of the telephone network, as opposed to at the subscriber's communication device (i.e., the user's telephone). In a typical application, a combination of speech synthesis and speech recognition elements is deployed within a telephone network or infrastructure. Callers may access the system and, via the speech synthesis element, be presented with informational prompts or queries in the form of synthesized or recorded speech. A caller will typically provide a spoken response to the synthesized speech and the speech recognition element will process the caller's spoken response in order to provide further service to the caller.
Given human nature and the design of some speech synthesis/recognition systems, the spoken responses provided by a caller will often occur during the presentation of an output audio signal, for example, a synthesized speech prompt. The processing of such occurrences is often referred to as “barge-in” processing. U.S. Pat. Nos. 4,914,692; 5,155,760; 5,475,791; 5,708,704; and 5,765,130 all describe techniques for barge-in processing. Generally, the techniques described in each of these patents address the need for echo cancellation during barge-in processing. That is, during the presentation of a synthesized speech prompt (i.e., an output audio signal), the speech recognition system must account for residual artifacts from the prompt being present in any spoken response provided by the user (i.e., an input speech signal) in order to effectively perform speech recognition analysis. Thus, these prior art techniques are generally directed to the quality of input speech signals during barge-in processing. Due to the relatively small latencies or delays found in voice telephony systems, these prior art techniques generally are not concerned with context determination aspects of barge-in processing, i.e., correlating an input speech signal to a particular output audio signal or to a particular moment within an output audio signal.
This deficiency of the prior art is even more pronounced with regard to wireless systems. Although a substantial body of prior art exists regarding telephony-based speech recognition systems, the incorporation of speech recognition systems into wireless communication systems is a relatively new development. In an effort to standardize the application of speech recognition in wireless communication environments, work has recently been initiated by the European Telecommunications Standards Institute (ETSI) on the so-called Aurora Project. A goal of the Aurora Project is to define a global standard for distributed speech recognition systems. Generally, the Aurora Project is proposing to establish a client-server arrangement in which front-end speech recognition processing, such as feature extraction or parameterization, is performed within a subscriber unit (e.g., a hand-held wireless communication device such as a cellular telephone). The data provided by the front-end would then be conveyed to a server to perform back-end speech recognition processing.
It is anticipated that the client-server arrangement being proposed by the Aurora Project will adequately address the needs for a distributed speech recognition system. However, it is uncertain at this time how barge-in processing will be addressed, if at all, by the Aurora Project. This is a particular concern given the wider variation in latencies typically encountered in wireless systems and the effect that such latencies could have on barge-in processing. For example, it is not uncommon for the processing of a user's speech-based response to be based in part upon the particular point in time at which it was received by the speech recognition processor. That is, it can make a difference whether a user's response is received during a particular part of a given synthesized prompt or, if a series of discrete prompts are provided, during which prompt the response was received. In short, the context of a user's response can be as equally important as recognizing the informational content of the user's response. However, the uncertain delay characteristics of some wireless systems stands as an impediment to properly determining such contexts. Thus, it would be advantageous to provide techniques for determining a context of an input speech signal during the presentation of an output audio signal, particularly in systems having uncertain and/or widely varying delay characteristics, such as those utilizing packet data communications.