A variety of automatic speech recognizers (ASRs) exist for performing functions such as converting speech into text and controlling the operations of a computer in response to speech. Some applications of automatic speech recognizers require shorter turnaround times (the amount of time between when the speech is spoken and when the speech recognizer produces output) than others in order to appear responsive to the end user. For example, a speech recognizer that is used for a “live” speech recognition application, such as controlling the movement of an on-screen cursor, may require a shorter turnaround time (also referred to as a “response time”) than a speech recognizer that is used to produce a transcript of a medical report.
The desired turnaround time may depend, for example, on the content of the speech utterance that is processed by the speech recognizer. For example, for a short command-and-control utterance, such as “close window,” a turnaround time above 500 ms may appear sluggish to the end user. In contrast, for a long dictated sentence which the user desires to transcribe into text, response times of 1000 ms may be acceptable to the end user. In fact, in the latter case users may prefer longer response times because they may otherwise feel that their speech is being interrupted by the immediate display of text in response to their speech. For longer dictated passages, such as entire paragraphs, even longer response times of multiple seconds may be acceptable to the end user.
In typical prior art speech recognition systems, improving response time while maintaining recognition accuracy requires increasing the computing resources (processing cycles and/or memory) that are dedicated to performing speech recognition. Similarly, in typical prior art speech recognition systems, recognition accuracy may typically be increased without sacrificing response time only by increasing the computing resources that are dedicated to performing speech recognition. One example of a consequence of these tradeoffs is that when porting a given speech recognizer from a desktop computer platform to an embedded system, such as a cellular telephone, with fewer computing resources, recognition accuracy must typically be sacrificed if the same response time is to be maintained.
One known technique for overcoming these resource constraints in the context of embedded devices is to delegate some or all of the speech recognition processing responsibility to a speech recognition server that is located remotely from the embedded device and which has significantly greater computing resources than the embedded device. When a user speaks into the embedded device in this situation, the embedded device does not attempt to recognize the speech using its own computing resources. Instead, the embedded device transmits the speech (or a processed form of it) over a network connection to the speech recognition server, which recognizes the speech using its greater computing resources and therefore produces recognition results more quickly than the embedded device could have produced with the same accuracy. The speech recognition server then transmits the results back over the network connection to the embedded device. Ideally this technique produces highly-accurate speech recognition results more quickly than would otherwise be possible using the embedded device alone.
In practice, however, this use of server-side speech recognition technique has a variety of shortcomings. In particular, because server-side speech recognition relies on the availability of high-speed and reliable network connections, the technique breaks down if such connections are not available when needed. For example, the potential increases in speed made possible by server-side speech recognition may be negated by use of a network connection without sufficiently high bandwidth. As one example, the typical network latency of an HTTP call to a remote server can range from 100 ms to 500 ms. If spoken data arrives at a speech recognition server 500 ms after it is spoken, it will be impossible for that server to produce results quickly enough to satisfy the minimum turnaround time (500 ms) required by command-and-control applications. As a result, even the fastest speech recognition server will produce results that appear sluggish if used in combination with a slow network connection.
What is needed, therefore, are improved techniques for producing high-quality speech recognition results for embedded devices within the turnaround times required by those devices, but without requiring low-latency high-availability network connections.