With the advent of pagers and mobile phones the wireless service industry has grown into a multi-billion dollar industry. The bulk of the revenues for Wireless Service Providers (WSPs),originate from subscriptions. As such, a WSP's ability to run a successful network is dependent on the quality of service provided to subscribers over a network having a limited bandwidth. To this end, WSPs are constantly looking for ways to mitigate the amount of information that is transmitted over the network while maintaining a high quality of service to subscribers.
Recently, speech recognition has enjoyed success in the wireless service industry. Speech recognition is used for a variety of applications and services. For example, a wireless service subscriber can be provided with a speed-dial feature whereby the subscriber speaks the name of a recipient of a call into the wireless device. The recipient's name is recognized using speech recognition and a call is initiated between the subscriber and the recipient. In another example, caller information (411) can utilize speech recognition to recognize the name of a recipient to whom a subscriber is attempting to place a call.
As speech recognition gains acceptance in the wireless community, Distributed Speech Recognition (DSR) has arisen as an emerging technology. DSR refers to a framework in which the feature extraction and the pattern recognition portions of a speech recognition system are distributed. That is, the feature extraction and the pattern recognition portions of the speech recognition system are performed by two different processing units at two different locations. Specifically, the feature extraction process is performed on the front-end, i.e., the wireless device, and the pattern recognition process is performed on the back-end, i.e., by the wireless service provider. DSR enhances speech recognition for more complicated tasks such as automated airline booking with spoken flight information or brokerage transactions with similar features.
The European Telecommunications Standards Institute (ETSI) promulgates a set of standards for DSR. The ETSI DSR standards ES 201 108 (April 2000) and ES 202 050 (July 2002) define the feature extraction and compression algorithms at the front-end. These standards, however, do not incorporate speech reconstruction at the back-end, which may be important in some applications. As a result, new Work Items WI-030 and WI-034 have been released by ETSI to extend the above standards (ES 201 108 and ES 202 050, respectively) to include speech reconstruction at the back-end as well as tonal language recognition.
In the current DSR standards, the features that are extracted, compressed, and transmitted to the back-end are 13 Mel Frequency Cepstral Coefficients (MFCC), C0-C12, and the logarithm of the frame-energy, log-E. These features are updated every 10 ms or 100 times per second. In the proposals for the extended standards (i.e., the Work Items described above), pitch and class (or voicing) information are also derived for each frame and transmitted in addition to the MFCC's and log-E. This increases the amount of information that is transmitted by the wireless device over the network and consumes additional bandwidth. Thus, it is desirable that the representation of class and pitch information be as compact as possible in order to keep the bit rate low.
It has been an ongoing problem to represent pitch information compactly and without sacrificing accuracy and robustness against communication channel errors. In general, speech vocoders (e.g., Mixed-Excitation Linear Predictive (MELP) coder, which is the U.S. Federal Standard at 2400 bps) absolutely quantize pitch information using 7 or more bits per frame. In the Extended DSR standards, it is important to keep the additional bit rate due to pitch and class information as low as possible. A combination of absolute and differential techniques has been adopted to quantize the pitch period information using only 6 bits per frame, thus saving at least 1 bit per frame. However, this can potentially generate problems in terms of accuracy and robustness to channel errors.
Therefore a need exists to overcome the problems with the prior art as discussed above.