With the advent of mobile phones and wireless communication devices the wireless service industry has grown into a multi-billion dollar industry. The bulk of the revenues for Wireless Service Providers (WSPs) originate from subscriptions. As such, a WSP's ability to run a successful network is dependent on the quality of service provided to subscribers over a network having a limited bandwidth. To this end, WSPs are constantly looking for ways to mitigate the amount of information that is transmitted over the network while maintaining a high quality of service to subscribers.
Recently, speech recognition has enjoyed success in the wireless service industry. Speech recognition is used for a variety of applications and services. For example, a wireless service subscriber can be provided with a speed-dial feature whereby the subscriber speaks the name of a recipient of a call into the wireless device. The recipient's name is recognized using speech recognition and a call is initiated between the subscriber and the recipient. In another example, caller information (411) can utilize speech recognition to recognize the name of a recipient to whom a subscriber is attempting to place a call.
As speech recognition gains acceptance in the wireless community, Distributed Speech Recognition (DSR) has arisen as an emerging technology. DSR refers to a framework in which the feature extraction and the pattern recognition portions of a speech recognition system are distributed. That is, the feature extraction and the pattern recognition portions of the speech recognition system are performed by two different processing units at two different locations. Specifically, the feature extraction process is performed on the front-end, i.e., the wireless device, and the pattern recognition process is performed on the back-end, i.e., by the wireless service provider system. DSR enables the wireless device handle more complicated speech recognition tasks such as automated airline booking with spoken flight information or brokerage transactions with similar features.
The European Telecommunications Standards Institute (ETSI) has issued a set of standards for DSR. The ETSI DSR standards ES 201 108 (April 2000) and ES 202 050 (July 2002) define the feature extraction and compression algorithms at the front-end. These standards, however, do not incorporate speech reconstruction at the back-end, which may be important in some applications. As a result, new Work Items WI-030 and WI-034 have been released by ETSI to extend the above standards (ES 201 108 and ES 202 050, respectively) to include speech reconstruction at the back-end as well as tonal language recognition.
In the current DSR standards, the features that are extracted, compressed, and transmitted to the back-end are 13 Mel Frequency Cepstral Coefficients (MFCC), C0–C12, and the logarithm of the frame-energy, log-E. These features are updated every 10 ms or 100 times per second. In the proposals for the extended standards (i.e., the Work Items described above), pitch and class (or voicing) information are also intended to be derived for each frame and transmitted in addition to the MFCC's and log-E. However, the pitch information extraction method remains to be defined in the extensions to the current DSR standards.
A variety of techniques have been used for pitch estimation using either time-domain methods or frequency-domain methods. It is well known that a speech signal representing a voiced sound within a relatively short frame can be approximated by a periodic signal. This periodicity is characterized by a period cycle duration (pitch period) T or by its inverse called fundamental frequency F0. Unvoiced sound is represented by an aperiodic speech signal. In standard vocoders, e.g., LPC-10 vocoder and MELP (Mixed Excitation Linear Predictive) vocoder, time-domain methods have been commonly used for pitch extraction. A common method for time-domain pitch estimation also uses correlation-type schemes, which search for a pitch period T that maximizes the cross-correlation between a signal segment centered at time t and one centered at time t-T. Pitch estimation using time-domain methods has had varying success depending on the complexity involved and background noise conditions. Such time-domain methods in general tend to be better for high pitch sounds because of the many pitch periods contained in a given time window.
As is well known, the Fourier spectrum of an infinite periodic signal is a train of impulses (harmonics, lines) located at multiples of the fundamental frequency. Consequently frequency-domain pitch estimation is typically based on analyzing the locations and amplitudes of spectral peaks. A criterion for fundamental frequency search (i.e., for estimation of pitch) is a high level of compatibility between the fundamental frequency value and the spectral peaks. Frequency-domain methods in general tend to be better for estimating pitch of low pitch frequency sounds because of a large number of harmonics typically within an analysis bandwidth. Since frequency domain methods analyze the spectral peaks and not the entire spectrum, the information residing in a speech signal is only partially used to estimate the fundamental frequency of a speech sample. This fact is a reason for both advantages and disadvantages of frequency domain methods. The advantages are potential tolerance with respect to the deviation of real speech data from the exact periodic model, noise robustness, and relative effectiveness in terms of reduced computational complexity. However, the search criteria cannot be viewed as a sufficient condition because only a part of spectral information is tested. Since known frequency-domain methods for pitch extraction typically use only the information about the harmonic peaks in the spectrum, these known frequency-domain methods used alone result in pitch estimates that are subject to unacceptable accuracy and errors for DSR applications.