1. Field of the Invention
The invention pertains to a method for providing enhanced performance for speech recognition services on digital wireless networks, and more particularly to a digital connection for voice activated services on wireless networks.
2. Background Art
A voice service node (VSN), is a platform which interacts with the telecommunication network to which it is attached through a switch, and provides one or more services such as banking information, user profiles, voice messages, call delivery, direct dialling, etc. under voice control, through speech recognition. The VSN guides the dialogue with the user through voice prompts, usually questions aimed at defining which information the user requires. An automatic speech recognizer is used to recognize what is being said and the information is used to control the behaviour of the service rendered to him/her.
Modem speech recognizers make use of phoneme based recognition, which relies on phone-based sub-word models to perform speaker independent recognition over the phone. In the recognition process, speech `features` are computed for each incoming frame. Modern speech recognizers also have a feature called rejection. When rejection exists, the recognizer has the ability to indicate that what was uttered does not correspond to any of the words in the lexicon.
End-pointing is the process whereby the speech recognizer tries to determine exactly when a person begins and ends speaking. End points are also used to determine if the person did not actually say anything, or said something that is longer than expected, which will be likely out of the vocabulary.
The users of wireless communication services expect to have access to all services available to the users of the land communication systems, and to receive a similar quality of service. The voice activated services are particularly important to the wireless subscribers because the dial pad is generally away from sight when the subscriber listens to a vocal prompt, or is out of sight when driving a car. With speech recognition, there are virtually no restrictions on mobility, because callers do not have to take their eyes off the road to punch in the keys on the terminal.
Unlike land connections, the wireless connections used for mobile and fixed access communications are subject to a number of impairments, such as the time varying `multipath fading`, shadowing, interference, etc., that result in channel errors. These errors degrade the quality of voice and services provided to the mobile users. For example, multipath fading is a physical phenomenon due to lack of a direct line-of-sight communication between the antennae at the edges of the communication channel, such as the antenna at the cell site and the antenna of a mobile. Instead, the signal is reflected and diffracted by building surfaces and edges, or by natural objects such as hills, mountains, trees, so that the signal received on an antenna is the sum of multiple signals, each having followed its own path.
Most digital wireless systems encode and transmit speech in packets built from speech samples corresponding to a time slice called frame. For example, many systems collect and transmit speech information on 20 ms frames. Because of the wireless impairments mentioned above, the compressed information is sent with forward error control (FEC) protection and some mechanism (CRC) to detect at the receiver when a frame has been damaged to the point of being unusable or `bad`. The current approach to correct the air link errors is a standard `replication and muting` sequence effected by the speech decoder. When such a `bad` frame is received, the speech decoder uses information from previous `good` frames to regenerate speech; eventually the signal is muted.
As such, in the traditional wireless communication systems, pulse code modulated (PCM) samples coming out of the mobile telephone exchange (MTX) are sent to the VSN, feeding the speech recognizer with a signal that is attenuated and sometimes muted due to the RF impairments. Speech recognition errors occur as a result. In particular, the end-pointer that finds the beginning and end of each word is adversely affected by the muting intervals that can have the appearance of silence following speech, while actually occurring during utterances.
Furthermore, the recognizer does not have any indication about the frame boundaries or about which frames were muted/replicated or which are `good`.
There is a need for enhancing the performance of the speech recognizer by providing the VSN with a means for minimizing the effects of air link errors.