1. Field
The present invention relates generally to the field of communications and more specifically to transmitting speech activity in a distributed voice recognition system.
2. Background
Voice recognition (VR) represents an important technique enabling a machine with simulated intelligence to recognize user-voiced commands and to facilitate a human interface with the machine. VR also represents a key technique for human speech understanding. Systems employing techniques to recover a linguistic message from an acoustic speech signal are called voice recognizers.
VR, also known as speech recognition, provides certain safety benefits to the public. For example, VR may be employed to replace the manual task of pushing buttons on a wireless keypad, a particularly useful replacement when the operator is using a wireless handset while driving an automobile. When a user employs a wireless telephone without VR capability, the driver must remove his or her hand from the steering wheel and look at the telephone keypad while pushing buttons to dial the call. Such actions tend to increase the probability of an automobile accident. A speech-enabled automobile telephone, or telephone designed for speech recognition, enables the driver to place telephone calls while continuously monitoring the road. In addition, a hands-free automobile wireless telephone system allows the driver to hold both hands on the steering wheel while initiating a phone call. A sample vocabulary for a simple hands-free automobile wireless telephone kit might include the 10 digits, the keywords “call,” “send,” “dial” “cancel,” “clear,” “add,” “delete,” history,” “program,” “yes,” and “no,” and the names of a predefined number of commonly called co-workers, friends, or family members.
A voice recognizer, or VR system, comprises an acoustic processor, also called the front end of a voice recognizer, and a word decoder, also called the back end of the voice recognizer. The acoustic processor performs feature extraction for the system by extracting a sequence of information bearing features, or vectors, necessary for performing voice recognition on the incoming raw speech. The word decoder subsequently decodes the sequence of features, or vectors, to provide a meaningful and desired output, such as the sequence of linguistic words corresponding to the received input utterance.
In a voice recognizer implementation using a distributed system architecture, it is often desirable to place the word decoding task on a subsystem having the ability to appropriately manage computational and memory load, such as a network server. The acoustic processor should physically reside as close to the speech source as possible to reduce adverse effects associated with vocoders. Vocoders compress speech prior to transmission, and can in certain circumstances introduce adverse characteristics due to signal processing and/or channel induced errors. These effects typically result from vocoding at the user device. The advantage to a Distributed Voice Recognition (DVR) system is that the acoustic processor resides in the user device and the word decoder resides remotely, such as on a network, thereby decreasing the risk of user device signal processing errors or channel errors.
DVR systems enable devices such as cell phones, personal communications devices, personal digital assistants (PDAs), and other devices to access information and services from a wireless network, such as the Internet, using spoken commands. These devices access voice recognition servers on the network and are much more versatile, robust and useful than systems recognizing only limited vocabulary sets.
In wireless applications, air interface methods degrade the overall accuracy of the voice recognition systems. This degradation can in certain circumstances be mitigated by extracting VR features from a user's spoken commands. Extraction occurs on a device, such as a subscriber unit, also called a subscriber station, mobile station, mobile, remote station, remote terminal, access terminal, or user equipment. The subscriber unit can transmit the VR features in data traffic, rather than transmitting spoken words in voice traffic.
Thus, in a DVR system, front end features are extracted at the device and are sent to the network. A device may be mobile or stationary, and may communicate with one or more base stations (BSes), also called cellular base stations, cell base stations, base transceiver system (BTSes), base station transceivers, central communication centers, access points, access nodes, Node Bs, and modem pool transceivers (MPTs).
Complex voice recognition tasks require significant computational resources. Such systems cannot practically reside on a subscriber unit having limited CPU, battery, and memory resources. Distributed systems leverage the computational resources available on the network. In a typical DVR system, the word decoder has significantly higher computational and memory requirements than the front end of the voice recognizer. Thus a server based voice recognition system within the network serves as the backend of the voice recognition system and performs word decoding. Using the server based VR system as the backend provides the benefit of performing complex VR tasks using network resources rather than user device resources. Examples of DVR systems are disclosed in U.S. Pat. No. 5,956,683, entitled “Distributed Voice Recognition System,” assigned to the assignee of the present invention and incorporated by reference herein.
The subscriber may perform simple VR tasks in addition to the feature extraction function. Performance of these functions at the user terminal frees the network of the need to engage in simple VR tasks, thereby reducing network traffic and the associated cost of providing speech enabled services. In certain circumstances, traffic congestion on the network can result in poor service for subscriber units from the server based VR system. A distributed VR system enables rich user interface features using complex VR tasks, with the downside of increased network traffic and occasional delay.
As part of the VR system, it can be beneficial to reduce network traffic by transmitting data smaller than actual speech over the air interface, such as speech features or other voice parameters. It has been found that the use of a Voice Activity Detection (VAD) module in the mobile device can reduce network traffic by converting speech into frames and transmitting those frames over the air interface. However, in particular circumstances, the nature and quality of the content of these frames can drastically affect overall system performance. Speech subsets that operate under one set of circumstances may in other circumstances require excessive processing at the server, thereby diminishing the quality of the conversation.
In a DVR system, a need exists for a reduction in overall network congestion and the amount of delay in the system as well as the ability to provide efficient voice activity detection functionality for the system based on circumstances presented.