The present invention relates generally to the field of automatic speech recognition, and more particularly to a method and apparatus for providing improved speech recognition system performance in a distributed automatic speech recognition system for use over wireless channels.
The task of automatic speech recognition comprises the automated identification of words or phrases which have been spoken by an individual, typically in order to enable an automated system to take certain (automated) actions in response thereto (e.g., to control a system by voice input). One speech recognition scenario receiving a great deal of recent attention involves performing automatic speech recognition (ASR) in environments which employ a wireless (e.g., cellular) communication channel. Such ASR over wireless/cellular networks has become increasingly important in the design of next generation wireless multimedia systems. In particular, a variety of spoken dialogue system applications which utilize ASR technology already exist today. These include, inter alia, personal assistants, speech portals, travel reservation systems, stock quote systems, etc. And the number of such applications which are being implemented specifically for use with mobile telephones in automobiles, for example, as well as for other wireless devices, is also increasing rapidly.
Conventionally, when automatic speech recognition functions were intended to be applied in a wireless environment, the entire speech recognition process was typically placed at the receiving end of the communications channel. In particular, conventional speech coding techniques were employed for transmitting the speech over the wireless channel, and only then (at the receiving end) was the speech recognition process performed, and typically, only after the encoded speech was decoded at the receiving end. Specifically, an encoding of the speech signal was performed at the wireless device, the encoded signal was transmitted across the wireless channel, the signal was decoded at the receiving end of the wireless channel (e.g., at the base station) to xe2x80x9creconstructxe2x80x9d the original speech, and finally, the automatic speech recognition process was initiated on the reconstructed speech in a totally conventional manner (i.e., as if no wireless channel transmission had been performed at all). Most commonly this approach was employed as a matter of necessity, because the computational complexity of performing the speech recognition process in the wireless device itself was prohibitive.
More recently, however, one particularly intriguing approach to the problem of ASR over a wireless channel which has been investigated involves the use of what has been referred to as a xe2x80x9cdistributedxe2x80x9d ASR system. By xe2x80x9cdistributedxe2x80x9d we mean that the functions which need to be performed in order to effectuate the speech recognition process are divided and separately located at the two xe2x80x9cendsxe2x80x9d of the wireless channelxe2x80x94some of the functions are located at the transmitting end of the channel (e.g., at the wireless device itself), and some are located at the receiving end of the wireless communication channel (e.g., at the base station). Such an approach allows users to share expensive resources on a centralized server, which usually provides extensive processing power and memory. Moreover, the distributed system design enables centralized installation and maintenance of ASR software and frees the user from difficult installation and maintenance procedures. She alternative approach of performing speech recognition locally on the wireless device significantly increases computation, power and memory requirements for the device, and limits portability across languages and application domains. With today""s technology, only speech recognition systems with a very limited vocabulary such as, for example, speaker-trained name dialing, can practically reside on the handset, while the great majority of applications must reside on the network server.
More particularly, in accordance with one such distributed ASR scenario, a small client program running in the wireless device extracts representative parameters of the speech signal (usually referred to in the ASR art as xe2x80x9cfeaturesxe2x80x9d) from the mobile terminal and transmits these parameters over the wireless channel to a speech recognition server. The server may, for example, be a multi-user server which performs speech recognition tasks for a plurality of distinct mobile terminals. In any event, at the server, automatic speech recognition is performed based on these features in an otherwise conventional manner, such as, for example, with use of hidden Markov models (HMMs), all of which is fully familiar to those of ordinary skill in the art.
In addition, one of the well-known complexities of wireless communication technology in general results from the problem of transmission errors which are invariably encountered when data is transmitted across a wireless channel. As a result, a great deal of attention has been recently given to the problem of error detection and error correction in a wireless transmission environment. Specifically, a wide variety of channel coding schemes have been developed, each providing various levels of error detection and correction capability at a given cost in additional bits which must be transmitted across the wireless channel. Although this issue has been studied extensively, it is invariably the case that the goal of such error mitigation strategies is to initially detect, and then, where possible, to eliminate the effects of such transmission errors. However, in many cases, these errors cannot be totally eliminated, but rather, the wireless receiver (e.g., the base station) may be presented with transmitted data of questionable reliability. In such cases, prior art wireless systems (whether used for ASR or not) would most typically either assume the data to be correct (despite having recognized that there is a significant probability that it is not), or else would consider the data to be totally unreliable and therefore xe2x80x9clostxe2x80x9d (or xe2x80x9cerasedxe2x80x9d), and would therefore simply discard it.
In accordance with the principles of the present invention, it has been recognized that certain channel coding schemes can advantageously provide not only error detection and correction capabilities, but also probabilistic information concerning the likelihood that a given portion of the data has been accurately decoded to a particular value. More specifically, such schemes can be used to provide probabilistic accuracy information for the decoded bits. Based on this recognition, the present invention provides a method and apparatus for performing automatic speech recognition in a distributed ASR system for use over a wireless channel which takes advantage of such probabilistic information. That is, in accordance with an illustrative embodiment of the present invention, accuracy probabilities for the decoded features are advantageously computed and employed to improve speech recognition performance under adverse channel conditions (i.e., in the present of transmission errors or losses).
Specifically, and in accordance with one illustrative embodiment of the present invention, the bit error probabilities for each of the bits which are used to encode a given ASR feature are used to compute the confidence level that the system may have in the decoded value of that feature. Features that have been corrupted with high probability are advantageously either not used or, more generally, weighted less in the acoustic distance computation performed by the speech recognizer. This novel approach to acoustic decoding is referred to herein as xe2x80x9csoft feature decoding,xe2x80x9d and produces dramatic improvements in ASR performance under certain adverse channel conditions.
More specifically, the present invention provides a method and apparatus for performing automatic speech recognition, the method comprising the steps of receiving a set of encoded speech features, the encoded speech features having been transmitted across a communications channel decoding the set of encoded speech features to generate one or more decoded speech features and one or more probability measures associated therewith, each probability measure comprising an estimate of a likelihood that the decoded speech feature corresponding thereto has been accurately transmitted and decoded; and performing speech recognition based upon said one or more decoded speech features and on said one or more probability measures associated therewith.