An automatic speech recognition (ASR) system determines what a speech input says. Typically, the input speech is processed into a sequence of digital speech feature frames. Each speech feature frame can be thought of as a multi-dimensional vector that represents various characteristics of the speech signal present during a short time window of the speech. For example, the multi-dimensional vector of each speech frame can be derived from cepstral features of the short time Fourier transform spectrum of the speech signal (MFCCs)—the short time power or component of a given frequency band—as well as the corresponding first- and second-order derivatives (“deltas” and “delta-deltas”). In a continuous recognition system, variable numbers of speech frames are organized as “utterances” representing a period of speech followed by a pause, which in real life loosely corresponds to a spoken sentence or phrase.
The ASR system compares the input utterances to find statistical acoustic models that best match the vector sequence characteristics and determines corresponding representative text associated with the acoustic models. More formally, given some input observations A, the probability that some string of words W were spoken is represented as P(W|A), where the ASR system attempts to determine the most likely word string:
      W    ^    =      arg    ⁢                  ⁢                  max        W            ⁢                          ⁢              P        ⁡                  (                      W            ❘            A                    )                    Given a system of statistical acoustic models, this formula can be re-expressed as:
      W    ^    =      arg    ⁢                  ⁢                  max        W            ⁢                          ⁢                        P          ⁡                      (            W            )                          ⁢                  P          ⁡                      (                          A              ❘              W                        )                              where P(A|W) corresponds to the acoustic models and P(W) reflects the prior probability of the word sequence as provided by a statistical language model.
The acoustic models are typically probabilistic state sequence models such as hidden Markov models (HMMs) that model speech sounds using, for example, mixtures of probability distribution functions (Gaussians) or neural networks. Acoustic models often represent phonemes in specific contexts, referred to as PELs (Phonetic Elements), e.g. triphones or phonemes with known left and/or right contexts. State sequence models can be scaled up to represent words as connected sequences of acoustically modeled phonemes, and phrases or sentences as connected sequences of words. When the models are organized together as words, phrases, and sentences, additional language-related information is also typically incorporated into the models in the form of a statistical language model.
The words or phrases associated with the best matching model structures are referred to as recognition candidates or hypotheses. A system may produce a single best recognition candidate—the recognition result—or multiple recognition hypotheses in various forms such as an N-best list, a recognition lattice, or a confusion network. Further details regarding continuous speech recognition are provided in U.S. Pat. No. 5,794,189, entitled “Continuous Speech Recognition,” and U.S. Pat. No. 6,167,377, entitled “Speech Recognition Language Models,” the contents of which are incorporated herein by reference.
In recent years, the functionality provided on mobile devices by speech recognition technology has expanded significantly beyond mere text entry and searching to include intelligent personal assistant (IPA) systems using ASR that combine user inputs from speech and/or text together with context and location information and other information sources to actually carry out useful services for the user. IPA applications such as Apple's Siri and Nuance's Nina products have become widely available in contemporary smartphone devices. FIGS. 1A-C show various example screen shots of the application interface 100 from one such IPA application, Nuance Nina, used in a conversational dialog with the user to arrange for payment of a credit card bill from the user's checking account.
FIG. 2 shows various elements in a typical client-server IPA arrangement for use with mobile devices; for example, a cloud-based computing arrangement using cloud-based services. A user interface 201 on mobile device 200 receives an initially unknown speech input signal 208 from a user. A local/remote controller 204 generates a representation of the speech input 208 and local ASR processor 202 uses local recognition data sources 203 to perform local ASR processing of the speech input signal to determine local ASR results corresponding to the speech input. Local/remote controller 204 sends the speech input representations and/or the local recognition results over a wireless communication network 205 to the remote server 206 for remote ASR/IPA processing. The server ASR 212 uses server ASR data sources 207 to perform remote ASR processing and passes the recognition results over to the server IPA 209 which also accesses other applications 210 and other data sources 211 to perform actions based on the user input 208 and pass the results back through the remote server 206 to the mobile device 200 for display on the user interface 201.
While the specific arrangement shown in FIG. 2 might suggest that all the various server-side components are in a single common location, of course, that is just one specific cloud-based client server IPA arrangement, and it is understood that the present discussion and the invention described and claimed herein are not limited to that specific topology and in other specific topologies, for example, individual components may be in different locations and communicate with each other in a cloud-based arrangement (i.e., via the Internet).
One of the challenges of client-server IPA arrangements is the inherent response latency in the various system components. Specifically, there are three main sources of system latency: (1) ASR latency, (2) IPA latency, and (3) network latency. The speech recognition process requires some significant amount of audio (corresponding to several words) before being able to produce recognition text that matches the input speech with high degree of probability, thereby providing one latency component. The IPA process contributes another latency component as it processes the user input and interacts with other applications and data sources. In addition, the remote server arrangement also creates an additional response latency that reflects data transfer delays occurring over the communications network. The combined effects of all these response latencies can be minimized to some degree, but they cannot be entirely eliminated due to algorithmic limitations in the IPA process, the speech recognition process, and physical limitations on computer network speed. Still, it is very desirable to minimize the effects of response latencies for the user.
In a real-time IPA application, the user effects associated with response latencies are two-fold. First, the user has no clear picture of the current state of the IPA system. If an utterance has been spoken, but the system response has not yet appeared on the user interface, the system presents an undefined state to the user. For all the user knows, the system may have failed to record the audio, the network connection may have been interrupted in a server-based speech recognition system, the speech recognition engine may have failed to produce output text, the IPA process may be hung up, or there may be a delay and results may be produced eventually. In addition, the user cannot continue with workflow tasks until the results from the pending input utterance have been completely processed and the user interface has been updated.
U.S. Patent Publication 20120216134 describes one existing approach for dealing with speech recognition response latencies by providing the user with partial recognition results as the recognition process progresses. Partial results are words that the recognizer considers the most probable at a given instant during the recognition process. As such, partial results are subject to change and the latency reduction is only apparent in giving the user a sense of low latency without reducing the actual speech recognition latency.