I. Field
The present invention pertains generally to the field of communications and more specifically to a system and method for transmitting speech activity in a distributed voice recognition system.
II. Background
Voice recognition (VR) represents one of the most important techniques to endow a machine with simulated intelligence to recognize user-voiced commands and to facilitate a human interface with the machine. VR also represents a key technique for human speech understanding. Systems that employ techniques to recover a linguistic message from an acoustic speech signal are called voice recognizers.
The use of VR (also commonly referred to as speech recognition) is becoming increasingly important for safety reasons. For example, VR may be used to replace the manual task of pushing buttons on a wireless telephone keypad. This is especially important when a user is initiating a telephone call while driving a car. When using a car telephone without VR, the driver must remove one hand from the steering wheel and look at the phone keypad while pushing the buttons to dial the call. These acts increase the likelihood of a car accident. A speech-enabled car telephone (i.e., a telephone designed for speech recognition) allows the driver to place telephone calls while continuously watching the road. In addition, a hands-free car-kit system would permit the driver to maintain both hands on the steering wheel during initiation of a telephone call. An exemplary vocabulary for a hands-free car kit might include the ten digits; the keywords “call,” “send,” “dial,” “cancel,” “clear,” “add,” “delete,” “history,” “program,” “yes,” and “no”; and the names of a predefined number of commonly called coworkers, friends, or family members.
A voice recognizer, i.e., a VR system, comprises an acoustic processor, also called the front-end of a voice recognizer, and a word decoder, also called the backend of a voice recognizer. The acoustic processor performs feature extraction. The acoustic processor extracts a sequence of information-bearing features (vectors) necessary for VR from the incoming raw speech. The word decoder decodes this sequence of features (vectors) to yield the meaningful and desired format of output, such as a sequence of linguistic words corresponding to the input utterance.
In a voice recognizer implementation using a distributed system architecture, it is often desirable to place the word-decoding task at the subsystem that can absorb the computational and memory load appropriately—at a network server. Whereas, the acoustic processor should reside as close to the speech source as possible to reduce the effects of vocoders (used for compressing speech prior to transmission) introduced by signal processing and/or channel induced errors—at a user device. Thus, in a Distributed Voice Recognition (DVR) system, the acoustic processor resides within the user device and the word decoder resides on a network.
DVR systems enable devices such as cell phones, personal communications devices, personal digital assistants (PDAs), etc., to access information and services from a wireless network, such as the Internet, using spoken commands, by accessing voice recognition servers on the network.
Air interface methods degrade accuracy of voice recognition systems in wireless applications. This degradation can be mitigated by extracting VR features from a user's spoken commands on a device, such as a subscriber unit (also called a subscriber station, mobile station, mobile, remote station, remote terminal, access terminal, and user equipment), and transmitting the VR features in data traffic, instead of transmitting spoken commands in voice traffic.
Thus, in a DVR system, front-end features are extracted in the device and sent to the network. A device may be mobile or stationary, and may communicate with one or more base stations (BSs) (also called cellular base stations, cell base stations, base transceiver systems (BTSs), base station transceivers, central communication centers, access points, access nodes, Node Bs, and modem pool transceivers (MPTs)).
Complex voice recognition tasks require significant computational resources. It is not practical to implement such systems on a subscriber unit with limited CPU, memory and battery resources. DVR systems leverage the computational resources available on the network. In a typical DVR system, the word decoder has more computational and memory requirements than the front-end of the voice recognizer. Thus, a server-based VR system within the network serves as the backend of the voice recognition system and performs word decoding. This has the benefit of performing complex VR tasks using the resources on the network. Examples of DVR systems are disclosed in U.S. Pat. No. 5,956,683, entitled “Distributed Voice Recognition System,” assigned to the assignee of the present invention and incorporated by reference herein.
In addition to feature extraction being performed on the subscriber unit, simple VR tasks can be performed on the subscriber unit, in which case the VR system on the network is not used for simple VR tasks. Consequently, network traffic is reduced with the result that the cost of providing speech-enabled services is reduced.
Notwithstanding the subscriber unit performing simple VR tasks, traffic congestion on the network can result in subscriber units obtaining poor service from the server-based VR system. A distributed VR system enables rich user interface features using complex VR tasks, but at the price of increasing network traffic and sometimes delay. If a local VR engine on the subscriber unit does not recognize a user's spoken commands, then the user's spoken commands have to be transmitted to the server-based VR engine after front-end processing, thereby increasing network traffic and network congestion. Network congestion occurs when a large quantity of network traffic is being transmitted at the same time from the subscriber unit to the server-based VR system. After the spoken commands are interpreted by a network-based VR engine, the results have to be transmitted back to the subscriber unit, which can introduce a significant delay if there is network congestion.
Thus, in a DVR system, there is a need for a system and method to reduce network congestion and to reduce delay. A system and method that reduces network congestion and reduces delay would improve VR performance.