Voice recognition (VR), nowadays more precisely called speech recognition, refers to a technique enabling a device to recover linguistic information from user-voiced speech. Once the device recognizes the linguistic information, the device may act on the information or cause another device to act on the information, thus facilitating a human interface with a device. Systems employing techniques to recover a linguistic message from an acoustic speech signal are called voice recognizers, or VR systems.
Recently, communication systems facilitating multiple-access, i.e., simultaneous transmission and/or reception, of several signals over a common communication channel have been developed and achieved widespread usage. Multiple-access communication systems often include a plurality of remote subscriber units requiring intermittent service of relatively short duration rather than continuous access to the common communication channel. Several multiple-access techniques are known in the art, such as time division multiple-access (TDMA) and a frequency division multiple-access (FDMA). Another type of a multiple-access technique is a code division multiple-access (CDMA) spread spectrum system that conforms to the “TIA/EIA/IS-95 Mobile Station-Base Station Compatibility Standard for Dual-Mode Wide-Band Spread Spectrum Cellular System,” hereinafter referred to as the IS-95 standard. The use of CDMA techniques in a multiple-access communication system is disclosed in U.S. Pat. No. 4,901,307, entitled “SPREAD SPECTRUM MULTIPLE-ACCESS COMMUNICATION SYSTEM USING SATELLITE OR TERRESTRIAL REPEATERS,” and U.S. Pat. No. 5,103,459, entitled “SYSTEM AND METHOD FOR GENERATING WAVEFORMS IN A CDMA CELLULAR TELEPHONE SYSTEM,” both assigned to the assignee of the present invention.
A multiple-access communication system may be a wireless or wire-line and may carry voice and/or data. An example of a communication system carrying both voice and data is a system in accordance with the IS-95 standard, which specifies transmitting voice and data over the communication channel. A method for transmitting data in code channel frames of fixed size is described in detail in U.S. Pat. No. 5,504,773, entitled “METHOD AND APPARATUS FOR THE FORMATTING OF DATA FOR TRANSMISSION”, assigned to the assignee of the present invention. In accordance with the IS-95 standard, the data or voice is partitioned into code channel frames that are 20 milliseconds wide with data rates as high as 14.4 Kbps. Additional examples of a communication systems carrying both voice and data comprise communication systems conforming to the “3rd Generation Partnership Project” (3GPP), embodied in a set of documents including Document Nos. 3G TS 25.211, 3G TS 25.212, 3G TS 25.213, and 3G TS 25.214 (the W-CDMA standard), or “TR-45.5 Physical Layer Standard for cdma2000Spread Spectrum Systems” (the IS-2000 standard).
In a multiple-access communication system, communications between users are conducted through one or more base stations. A first user on one subscriber station communicates to a second user on a second subscriber station by transmitting data on a reverse link to a base station. The base station receives the data and can route the data to another base station. The data is transmitted on a forward link of the same base station, or the other base station, to the second subscriber station. The forward link refers to transmission from a base station to a subscriber station and the reverse link refers to transmission from a subscriber station to a base station. Likewise, the communication can be conducted between a first user on one mobile subscriber station and a second user on a landline station. A base station receives the data from the user on a reverse link, and routes the data through a public switched telephone network (PSTN) to the second user. In many communication systems, e.g., IS-95, W-CDMA, IS-2000, the forward link and the reverse link are allocated separate frequencies.
A user usually interfaces with a subscriber station via a keypad and a display. Such an interface imposes certain limits on its operation. For example, when the user is engaged in another activity requiring visual and physical attention to the activity to operate the subscriber station, e.g., driving an automobile, the user must remove his or her hand from the steering wheel and look at the telephone keypad while pushing buttons on the keyboard. Such actions tend divert attention from driving. Even if a full concentration on the interface is assured, certain actions, e.g., entry of short messages in a short message system (SMS) enabled subscriber station, can be cumbersome.
As a result of these user interface problems, there is an interest in implementing a VR system into a subscriber station. In general, a VR system comprises an acoustic processor, also called the front end of the VR system, and a word decoder, also called the back end of the VR system. The acoustic processor performs feature extraction, i.e., extracting a sequence of information bearing features from a speech signal. Feature extraction is necessary for enabling recognition of the speech signal linguistic information. Extracted features are transmitted from the front end to the back end of the VR system. The word decoder decodes the sequence of features received to provide a meaningful and desired output, representing the linguistic information contained in the speech signal.
For complex voice recognition tasks, the computational requirement of the processing associated with VR is significant. In a typical DVR system, the word decoder has relatively high computational and memory requirements as measured against the front end of the voice recognizer. Consequently, it is often desirable to place the feature/word decoding task on a subsystem having the ability to appropriately manage computational and memory requirement, such as a network server, while keeping the acoustic processor physically as close to the speech source as possible to reduce adverse effects associated with vocoders. A vocoder is a device for processing the speech signal prior to transmission. Such a VR system implementation, using distributed system architecture, is known as a Distributed Voice Recognition (DVR) system. Thus, in a DVR system, feature extraction is performed at a device, such as a subscriber station comprising a front end, and the features subscriber station sends the features to the network, comprising a back end. The network decodes the features and provides a desired linguistic output. Examples of DVR systems are disclosed in U.S. Pat. No. 5,956,683, entitled “Distributed Voice Recognition System,” assigned to the assignee of the present invention.
Certain DVR systems and designs have been employed with varying results. Certain previous systems have operated at low frequency levels, such as in the range of 4 kHz, and have ignored or omitted certain high frequency components of speech, both on the subscriber station side and the network server side. Performance of such systems tend to favor low frequency components received at the expense of high frequency components, particularly those in excess of about 4 kHz. Failure to properly decode, implement, and pass these high frequency components has a tendency to miss certain aspects of the received analog speech signal and create an improper representation of the speech at the network server. Further, interpretation of received features at the network server has tended to use cepstral features exclusively. Cepstral features provide certain information on the features, but use of cepstral processing alone tends to omit certain aspects of speech, or fail to identify certain properties in the speech that are transferred over as features. Previous systems have also operated at a single or limited frequency, thus potentially again adversely affecting either the quality of the speech transmitted, the quality of features derived, or both.
As follows from the above description, there is a need in the art to extract acoustic features, including the high frequency components thereof, and transmit the features with minimal delay over the network such that the back end may process and employ high frequency components in to provide an enhanced acoustic representation of the received speech signal.