The invention relates generally to speech recognition, and more specifically, to distributed speech recognition between a wireless device and a communication server.
With the growth of speech recognition capabilities, there is a corresponding increase in the number of applications and uses for speech recognition. Different types of speech recognition application and systems have been developed, based upon the location of the speech recognition engine with respect to the user. One such example is an embedded speech recognition engine, otherwise known as a local speech recognition engine, such as a Speech2Go speech recognition engine sold by Speech Works International, Inc., 695 Atlantic Avenue, Boston, Mass. 02111. Another type of speech recognition engine is a network-based speech recognition engine, such as Speech Works 6, as sold by Speech Works International, Inc., 695 Atlantic Avenue, Boston, Mass. 02111.
Embedded or local speech recognition engines provide the added benefit of reduced latency in recognizing a speech input, wherein a speech input includes any type of audible or audio-based input. One of the drawbacks of embedded or local speech recognition engines is that these engines contain a limited vocabulary. Due to memory limitations and system processing requirements, in conjunction with power consumption limitations, embedded or local speech recognition engines are limited to providing recognition to only a fraction of the speech inputs which would be recognizable by a network-based speech recognition engine.
Network-based speech recognition engines provide the added benefit of an increased vocabulary, based on the elimination of memory and processing restrictions. Although a downside is the added latency between when a user provides a speech input and when the speech input may be recognized, and furthermore provided back to the end user for confirmation of recognition. Other disadvantages include the requirement for continuous availability of the communication path, the resulting increased server load, and the cost to the user of connection and service. In a typical speech recognition system, the user provides the speech input and the speech input is thereupon provided to a server across a communication path, whereupon it may then be recognized. Extra latency is incurred in not only transmitting the speech input to the network-based speech recognition engine, but also transmitting the recognized speech input, or an N-best list back to the end user.
One proposed solution to overcoming the inherent limitations of embedded speech recognition engines and the latency problems associated with network-based speech recognition engines is to preliminarily attempt to recognize all speech inputs with the embedded speech recognition engine. Thereupon, a determination is made if the local speech recognition engine has properly recognized the speech input, based upon, among other things, a recognition confidence level. If it is determined that the speech input has not been recognized by the local speech recognition engine, such that a confidence level is below a threshold value, the speech input is thereupon provided to a network-based speech recognition engine. This solution, while eliminating latency issues with respect to speech inputs that are recognized by the embedded speech recognition engine, adds an extra latency step for all other inputs by first attempting to recognize the speech input locally. Therefore, when the speech inputs must be recognized using the network-based speech recognition engine, the user is required to incur a further delay.
Another proposed solution to overcoming the limitations of embedded speech recognition engines and network-based speech recognition engines is to attempt to recognize the speech input both at the local level, using the embedded speech recognition engine, and at the server level, using the network-based speech recognition engine. Thereupon, both recognized speech inputs are then compared and the user is provided with a best-guess at the recognized inputs. Once again, this solution requires the usage of the network-based speech recognition engine, which may add extra latency if the speech input is recognizable by the embedded speech recognition engine.