With the advent of pagers and mobile phones the wireless service industry has grown into a multi-billion dollar industry. The bulk of the revenues for Wireless Service Providers (WSPs) originate from subscriptions. As such, a WSP's ability to run a successful network is dependent on the quality of service provided to subscribers over a network having a limited bandwidth. To this end, WSPs are constantly looking for ways to mitigate the amount of information that is transmitted over the network while maintaining a high quality of service to subscribers.
Recently, speech recognition has enjoyed success in the wireless service industry. Speech recognition is used for a variety of applications and services. For example, a wireless service subscriber can be provided with a speed-dial feature whereby the subscriber speaks the name of a recipient of a call into the wireless device. The recipient's name is recognized using speech recognition and a call is initiated between the subscriber and the recipient. In another example, caller information (411) can utilize speech recognition to recognize the name of a recipient to whom a subscriber is attempting to place a call.
As speech recognition gains acceptance in the wireless community, Distributed Speech Recognition (DSR) has arisen as an emerging technology. DSR refers to a framework in which the feature extraction and the pattern recognition portions of a speech recognition system are distributed. That is, the feature extraction and the pattern recognition portions of the speech recognition system are performed by two different processing units at two different locations. Specifically, the feature extraction process is performed on the front-end, i.e., the wireless device, and the pattern recognition process is performed on the back-end, i.e., by the wireless service provider. DSR enhances speech recognition for more complicated tasks such as automated airline booking with spoken flight information or brokerage transactions with similar features.
The European Telecommunications Standards Institute (ETSI) promulgates a set of standards for DSR. The ETSI DSR standards ES 201 108 (April 2000) and ES 202 050 (July 2002) define the feature extraction and compression algorithms at the front-end. These standards, however, do not incorporate speech reconstruction at the back-end, which may be important in some applications. As a result, new Work Items WI-030 and WI-034 have been released by ETSI to extend the above standards (ES 201 108 and ES 202 050, respectively) to include speech reconstruction at the back-end as well as tonal language recognition.
In the current DSR standards, the features that are extracted, compressed, and transmitted to the back-end are 13 Mel Frequency Cepstral Coefficients (MFCC), C0-C12, and the logarithm of the frame-energy, log-E. These features are updated every 10 ms or 100 times per second. In the proposals for the extended standards (i.e., in response to the Work Items described above), pitch and class (or voicing) information are also derived for each frame and transmitted in addition to the MFCC's and log-E. This increases the amount of information that is transmitted by the wireless device over the network and consumes additional bandwidth. Thus, it is desirable that the representation of class and pitch information be as compact as possible in order to keep the bit rate low.
In speech coders, the normal practice has been to quantize the pitch information and the class information separately. In some coders, “unvoiced” class is represented by a “zero pitch value”, e.g., the Mixed Excitation Linear Predictive (MELP) coder, which is the U.S. Federal Standard at 2400 bps. Unfortunately, the multiple types of classes proposed for the extended standards require increased amount of information to represent, and increased bandwidth to transmit, the class information.
Therefore a need exists to overcome the problems with the prior art as discussed above.