The present invention relates to automatic speech recognition and, more particularly, to a bitstream-based feature extraction process for wireless communication applications.
In the provisioning of many new and existing communication services, voice prompts are used to aid the speaker in navigating through the service. In particular, a speech recognizing element is used to guide the dialogue with the user through voice prompts, usually questions aimed at defining which information the user requires. An automatic speech recognizer is used to recognize what is being said and the information is used to control the behavior of the service rendered to the user.
Modern speech recognizers make use of phoneme-based recognition, which relies on phone-based sub-word models to perform speaker-independent recognition over the telephone. In the recognition process, speech xe2x80x9cfeaturesxe2x80x9d are computed for each incoming frame. Modern speech recognizers also have a feature called xe2x80x9crejectionxe2x80x9d. When rejection exists, the recognizer has the ability to indicate that what was uttered does not correspond to any of the words in the lexicon.
The users of wireless communication services expect to have access to all of the services available to the users of land-based wireline systems, and to receive a similar quality of service. The voice-activated services are particularly important to the wireless subscribers since the dial pad is generally away from sight when the subscriber listens to a vocal prompt, or is out of sight when driving a car. With speech recognition, there are virtually no restrictions on mobility, because callers do not have to take their eyes off the road to punch in the keys on the terminal.
Currently, one area of research is focusing on the front-end design for a wireless speech recognition system. In general, many prior art front-end designs fall into one of two categories, as illustrated in FIG. 1. FIG. 1(a) illustrates an arrangement 10 including a speech encoder 12 at the transmitting end, a communication channel 14 (such as a wireless channel) and a speech decoder 16 at the receiving end. The decoded speech is thereafter sent to EAR and also applied as an input to a speech recognition feature extractor 18, where the output from extractor 18 is thereafter applied as an input to an automatic speech recognizer (not shown). In a second arrangement 20 illustrated in FIG. 1(b), a speech recognition feature encoder 22 is used at the transmitting end to allow for the features themselves to be encoded and transmitted over the (wireless) channel 24. The encoded features are then applied as parallel inputs to both a speech decoder 26 and a speech recognition feature extractor 28 at the receiving end, the output from feature extractor 28 thereafter applied as an input to an automatic speech recognizer (not shown). This scheme is particularly useful in Internet access applications. For example, when the mel-frequency cepstral coefficients are compressed at a rate of approximately 4 kbit/s, the automatic speech recognizer (ASR) at the decoder side of the coder exhibits a performance comparable to a conventional wireline ASR system. However, this scheme is not able to generate synthesized speech of the quality produced by the system as shown in FIG. 1(a).
The need remaining in the prior art, therefore, is to provide an ASR front-end whose feature recognition performance is comparable to a wireline ASR and is also able to provide decoded speech of high quality.
The need remaining in the prior art is addressed by the present invention, which relates to a feature extraction system and method and, more particularly, to a bitstream-based extraction process that converts the quantized spectral information from a speech coder directly into a cepstrum.
In accordance with the present invention, the bitstream of the encoded speech is applied in parallel as inputs to both a front-end speech decoder and feature extractor. The feature parameters consist of both spectral envelope and voicing information. The spectral envelope is derived from the quantized line spectrum pairs (LSPs) followed by conversion to LPC cepstral coefficients. The voiced/unvoiced information is directly obtained from the bits corresponding to adaptive and fixed codebook gains of a speech coder. Thus, the cepstrum is directly converted in the speech decoder from the spectral information bits of the speech coder. The use of both the spectral envelope information and the voiced/unvoiced information yields a front-end feature extractor that is greatly improved over the prior art models.