A variety of automatic speech recognizers (ASRs) exist for performing functions such as converting speech into text and controlling the operations of a computer in response to speech. Some applications of automatic speech recognizers require shorter turnaround times (the amount of time between when the speech is spoken and when the speech recognizer produces output) than others in order to appear responsive to the end user. For example, a speech recognizer that is used for a “live” speech recognition application, such as controlling the movement of an on-screen cursor, may require a shorter turnaround time (also referred to as a “response time”) than a speech recognizer that is used to produce a transcript of a medical report.
The desired turnaround time may depend, for example, on the content of the speech utterance that is processed by the speech recognizer. For example, for a short command-and-control utterance, such as “close window,” a turnaround time above 500 ms may appear sluggish to the end user. In contrast, for a long dictated sentence which the user desires to transcribe into text, response times of 1000 ms may be acceptable to the end user. In fact, in the latter case users may prefer longer response times because they may otherwise feel that their speech is being interrupted by the immediate display of text in response to their speech. For longer dictated passages, such as entire paragraphs, even longer response times of multiple seconds may be acceptable to the end user.
In typical prior art speech recognition systems, increasing response time while maintaining recognition accuracy requires increasing the computing resources (processing cycles and/or memory) that are dedicated to performing speech recognition. As a result, many applications which require fast response times require the speech recognition system to execute on the same computer as that on which the applications themselves execute. Although such collocation may eliminate the delay that would otherwise be introduced by requiring the speech recognition results to be transmitted to the requesting application over a network, such collocation also has a variety of disadvantages.
For example, collocation requires a speech recognition system to be installed on every end user device—such as every desktop computer, laptop computer, cellular telephone, and personal digital assistant (PDA)—which requires speech recognition functionality. Installing and maintaining such speech recognition systems on such a large number and wide variety of devices can be tedious and time-consuming for end users and system administrators. For example, such maintenance requires system binaries to be updated when a new release of the speech recognition system becomes available. User data, such as speech models, are created and accumulated over time on individual devices, taking up precious storage space, and need to be synchronized with multiple devices used by the same user. Such maintenance can grow particularly burdensome as users continue to use speech recognition systems on a wider number and variety of devices.
Furthermore, locating a speech recognition system on the end user device causes the speech recognition system to consume precious computing resources, such as CPU processing cycles, main memory, and disk space. Such resources are particularly scarce on handheld mobile devices such as cellular telephones. Producing speech recognition results with fast turnaround times using such devices typically requires sacrificing recognition accuracy and reducing the resources available to other applications executing on the same device.
One known technique for overcoming these resource constraints in the context of embedded devices is to delegate some or all of the speech recognition processing responsibility to a speech recognition server that it located remotely from the embedded device and which has significantly greater computing resources than the embedded device. When a user speaks into the embedded device in this situation, the embedded device does not attempt to recognize the speech using its own computing resources. Instead, the embedded device transmits the speech (or a processed form of it) over a network connection to the speech recognition server, which recognizes the speech using its greater computing resources and therefore produces recognition results more quickly than the embedded device could have produced with the same accuracy. The speech recognition server then transmits the results back over the network connection to the embedded device. Ideally this technique produces highly-accurate speech recognition results more quickly than would otherwise be possible using the embedded device alone.
In practice, however, this “server-side speech recognition” technique has a variety of shortcomings. In particular, because server-side speech recognition relies on the availability of high-speed and reliable network connections, the technique breaks down if such connections are not available when needed. For example, the potential increases in speed made possible by server-side speech recognition may be negated by use of a network connection without sufficiently high bandwidth. As one example, the typical network latency of an HTTP call to a remote server can range from 100 ms to 500 ms. If spoken data arrives at a speech recognition server 500 ms after it is spoken, it will be impossible for that server to produce results quickly enough to satisfy the minimum turnaround time (500 ms) required by command-and-control applications. As a result, even the fastest speech recognition server will produce results that appear sluggish if used in combination with a slow network connection.
Furthermore, conventional server-side speech recognition techniques assume that the network connection established between the client (e.g., embedded device) and speech recognition server is kept alive continuously during the entire recognition process. Although it may be possible to satisfy this condition in a Local Area Network (LAN) or when both client and server are managed by the same entity, this condition may be impossible or at least unreasonable to satisfy when the client and server are connected over a Wide Area Network (WAN) or the Internet, in which case interruptions to the network connection may be common and unavoidable.
Furthermore, organizations often restrict the kinds of communications that their users can engage in over public networks such as the Internet. For example, organizations may only allow clients within their networks to engage in outbound communications. This means that a client can contact an external server on a certain port, but that the server cannot initiate contact with the client. This is an example of one-way communication.
Another common restriction imposed on clients is that they may only use a limited range of outbound ports to communicate with external servers. Furthermore, outgoing communication on those ports may be required to be encrypted. For example, clients often are allowed to use only the standard HTTP port (port 80) or the standard secure, encrypted HTTPS port (port 443).
What is needed, therefore, are improved techniques for producing speech recognition results with fast response times without overburdening the limited computing resources of client devices.