In the provisioning of many new and existing communication services, voice prompts are used to aid the speaker in navigating through the service. In particular, a speech recognizing element is used to guide the dialogue with the user through voice prompts, usually questions aimed at defining which information the user requires. An automatic speech recognizer is used to recognize what is being said and the information is used to control the behavior of the service rendered to the user.
Modern speech recognizers make use of phoneme-based recognition, which relies on phone-based sub-word models to perform speaker-independent recognition over the telephone. In the recognition process, speech “features” are computed for each incoming frame. Modem speech recognizers also have a feature called “rejection”. When rejection exists, the recognizer has the ability to indicate that what was uttered does not correspond to any of the words in the lexicon.
The users of wireless communication services expect to have access to all of the services available to the users of land-based wireline systems, and to receive a similar quality of service. The voice-activated services are particularly important to the wireless subscribers since the dial pad is generally away from sight when the subscriber listens to a vocal prompt, or is out of sight when driving a car. With speech recognition, there are virtually no restrictions on mobility, because callers do not have to take their eyes off the road to punch in the keys on the terminal.
Currently, one area of research is focusing on the front-end design for a wireless speech recognition system. In general, many prior art front-end designs fall into one of two categories, as illustrated in FIG. 1. FIG. 1(a) illustrates an arrangement 10 including a speech encoder 12 at the transmitting end, a communication channel 14 (such as a wireless channel) and a speech decoder 16 at the receiving end. The decoded speech is thereafter sent to EAR and also applied as an input to a speech recognition feature extractor 18, where the output from extractor 18 is thereafter applied as an input to an automatic speech recognizer (not shown). In a second arrangement 20 illustrated in FIG. 1(b), a speech recognition feature encoder 22 is used at the transmitting end to allow for the features themselves to be encoded and transmitted over the (wireless) channel 24. The encoded features are then applied as parallel inputs to both a speech decoder 26 and a speech recognition feature extractor 28 at the receiving end, the output from feature extractor 28 thereafter applied as an input to an automatic speech recognizer (not shown). This scheme is particularly useful in Internet access applications. For example, when the mel-frequency cepstral coefficients are compressed at a rate of approximately 4 kbit/s, the automatic speech recognizer (ASR) at the decoder side of the coder exhibits a performance comparable to a conventional wireline ASR system. However, this scheme is not able to generate synthesized speech of the quality produced by the system as shown in FIG. 1(a).
In speech coding, channel impairments are modeled by bit error insertion and frame erasure insertion devices, where the number of bit errors and frame erasures depends primarily on the noise, co-channel and adjacent channel interference, as well as frequency-selective fading. Fortunately, most speech coders are combined with a channel coder, where a “frame erasure” is declared if any of the most sensitive bits with respect to the channel is in error. The speech coding parameters of an erased frame must then be extrapolated in order to generate the speech signal for the erased frame. A family of error concealment techniques are known in the prior art and can generally be defined as either “substitution” or “extrapolation” techniques. In general, the parameters of the erased frames are reconstructed by repeating the parameters of the previous frame with scaled-down gain values. In conventional speech recognition systems, a decoded speech-based front-end uses the synthesized speech for extracting a feature. However, in a bitstream-based front-end, the parameters themselves are present.
The need remaining in the prior art, therefore, is to provide a technique for handling frame erasures in a bitstream-based front end speech recognition systems.