The present invention relates to Automatic Speech/Speaker Recognition (ASR) and, more particularly, ASR over wireless communications channels.
Automatic Speech/Speaker Recognition (ASR) has become ever more prevalent with improvements in hardware, modeling and recognition algorithms. Among many important applications of ASR technology are those in the telephone and other communications arts. For example, the use of ASR has proven valuable in providing directory assistance, automatic calling and other voice telephony applications over wire circuits. In a parallel area of development, the use of cellular systems, personal communications systems (PCS) and other wireless systems (collectively referred to as xe2x80x9cwirelessxe2x80x9d in the sequel) has continued to proliferate. It is natural, therefore, to seek to apply improvements in ASR achieved in wired systems to wireless systems as well.
ASR over wireless channels is problematic because of the additional noise and distortion introduced into voice signals during the coding, transmission (e.g., due to fading or packet loss), and decoding stages. Noise-degraded voice signals present in wireless environments are often substantially different from the original voice signal, leading to degradation in ASR performances when standard ASR techniques arc applied. This problem has become acute as attempts to create advanced ASR-based services, such as intelligent agent services or large vocabulary speech recognition services over digital wireless channels. Previous approaches have mainly focused on noise reduction techniques, but the results are far from ideal and of limited applicability because of the many variations in wireless environments (e.g. TDMA, CDMA, (GSM, etc.).
Recent studies found that if the feature vectors for ASR purpose can be extracted at the handset and transmitted digitally through a secondary digital channel, there is almost no performance degradation on the ASR performance in the wireless environment as compared to the wired telephone network. A typical prior art dual channel system is illustrated in FIG. 1. There, a cellular handset 101 is employed by a mobile user to encode normal speech and transmit the coded signal, including relevant coder parameters, through primary (voice) channel 105 to cellular base station 120. Base station 120 then decodes the received coded signal to produce a voice output suitable for communication over the public switched telephone network (PSTN), or other voice communications network as represented by public switch 130 and its output to a network. FIG. 1 also shows the generation at the cellular handset 101 of a second set of signals corresponding to the ASR parameters to be used by an ASR application. This second set of signals is transmitted over a second digital channel 110 to cellular base station 120, where they are forwarded to ASR system 140.
The experimental use of systems of the type shown in FIG. 1 have generated interest in creating a standard ASR feature set which can be extracted at the handset and sent through a wireless network as a digital signal using a secondary digital link. Since the bit rate for ASR feature vector transmission can be quite low ( less than 4 Kb/s), it is possible to use a secondary digital link such as that proposed for inclusion in new wireless standards such as IS-134. Although this secondary channel solution seems promising, it has a number of serious drawbacks. In particular this approach requires:
1. A new standard and major changes in communication protocols. Even so, incompatibilities with many current wireless communication standards would require modifications or abandonment of existing standards-compliant network equipment.
2. Extra bandwidth to transmit ASR feature vectors from the handset to the base-station. Synchronizing the primary digital channel for the transmission of voice and the secondary digital channel for the transmission of the extracted ASR feature vectors can also be a serious problem.
3. Major changes to current handsets.
4. A variety of dual-channel solutions. That is, dependence on particular present wireless standards or formats (CDMA, TDMA, GSM, IS-94, IS-134, etc.) and associated signaling and modulation schemes, make a universal solution impractical for all available standards.
5. High initial investment to introduce services based on this technique.
The limitations of the prior art are overcome and a technical advance is achieved in systems and methods for efficiently and economically enabling ASR capabilities in wireless contexts as described below in connection with illustrative embodiments.
Thus, in accordance with one aspect of the present invention, reliable ASR feature vector sequences are derived at a base station (or other network or system unit) directly from the digitally transmitted speech coder parameters. In many applications the ASR functions are performed at a public switch or elsewhere in a network. With this approach, a novel ASR feature extractor operates on the received speech coder parameters from the handset with no additional processing or signal modification required at the handset. Thus, speech coder parameters received at a base station are used not only for reproducing the voice signal, as at present, but also for generating the feature vector sequence for ASR applications.
An illustrative ASR feature vector extractor at the base-station in operating on digitally transmitted speech coder parameters prior to conversion of these coder parameters back to a voice signal avoids the lossy conversion process and associated voice distortion. In using embodiments of the present invention, there is no need to modify wireless handsets, since the ASR feature vectors are derived from the same set of speech coder parameters ordinarily extracted at the handset. Therefore, existing handsets provide a front end for the ASR feature vector extractor at the base station.
Moreover, the connection from the handset to the base station in digital wireless environments is all-digital and includes error protection for data signals communicated to a base station. Therefore, the transmission from the handset to the present inventive feature extractor at a base-station or other location has the same digital transmission quality as in secondary channel schemes.
Although speech coder parameters are very different from the feature vectors needed for ASR purposes, the present invention provides illustrative techniques for realizing a speech feature extractor based on normal speech coder parameters. Further, in accordance with another aspect of the present invention, perfect synchronization of the (decoded) voice signal and the ASR feature vector signal is provided without additional signal synchronization bits. This is possible, as disclosed in illustrative embodiments of the present invention, because both the voice signal and ASR feature vector signal are generated from the same speech coder parameters.
Overall, the present invention provides systems and methods for enhanced ASR with no need for a secondary channel and no major changes to current wireless standards. Changes, extensions and operational differences at base stations are also minimal. Advantageously, the digital channel for ASR applications is created (through modifications to software) as a second destination for a voice call.
Alternative embodiments perform the ASR feature extraction and ASR functions at a switch connected (directly or through network connections) to the receiving base station. In yet other embodiments the coded speech signals received at a base station from the transmitting handset are forwarded (with or without decoded speech signals) to a network location, including a terminal or storage system.