An automatic speech recognition (ASR) system determines a semantic meaning of a speech input. Typically, the input speech is processed into a sequence of digital speech feature frames. Each speech feature frame can be thought of as a multi-dimensional vector that represents various characteristics of the speech signal present during a short time window of the speech. For example, the multi-dimensional vector of each speech frame can be derived from cepstral features of the short time Fourier transform spectrum of the speech signal (MFCCs)—the short time power or component of a given frequency band—as well as the corresponding first- and second-order derivatives (“deltas” and “delta-deltas”). In a continuous recognition system, variable numbers of speech frames are organized as “utterances” representing a period of speech followed by a pause, which in real life loosely corresponds to a spoken sentence or phrase.
The ASR system compares the input utterances to find statistical acoustic models that best match the vector sequence characteristics and determines corresponding representative text associated with the acoustic models. More formally, given some input observations A, the probability that some string of words W were spoken is represented as P(W|A), where the ASR system attempts to determine the most likely word string:
      W    ^    =      arg    ⁢                  ⁢                  max        W            ⁢              P        ⁡                  (                      W            |            A                    )                    Given a system of statistical acoustic models, this formula can be re-expressed as:
      W    ^    =      arg    ⁢                  ⁢                  max        W            ⁢                        P          ⁡                      (            W            )                          ⁢                  P          ⁡                      (                          A              |              W                        )                              where P(A|W) corresponds to the acoustic models and P(W) reflects the prior probability of the word sequence as provided by a statistical language model.
The acoustic models are typically probabilistic state sequence models such as hidden Markov models (HMMs) that model speech sounds using mixtures of probability distribution functions (Gaussians). Acoustic models often represent phonemes in specific contexts, referred to as PELs (Phonetic Elements), e.g. triphones or phonemes with known left and/or right contexts. State sequence models can be scaled up to represent words as connected sequences of acoustically modeled phonemes, and phrases or sentences as connected sequences of words. When the models are organized together as words, phrases, and sentences, additional language-related information is also typically incorporated into the models in the form of a statistical language model.
The words or phrases associated with the best matching model structures are referred to as recognition candidates or hypotheses. A system may produce a single best recognition candidate—the recognition result—or multiple recognition hypotheses in various forms such as an N-best list, a recognition lattice, or a confusion network. Further details regarding continuous speech recognition are provided in U.S. Pat. No. 5,794,189, entitled “Continuous Speech Recognition,” and U.S. Pat. No. 6,167,377, entitled “Speech Recognition Language Models,” the contents of which are incorporated herein by reference.
Recently, ASR technology has advanced enough to have applications that are implemented on the limited footprint of a mobile device. This can involve a somewhat limited stand-alone ASR arrangement on the mobile device, or more extensive capability can be provided in a client-server arrangement where the local mobile device does initial processing of speech inputs, and possibly some local ASR recognition processing, but the main ASR processing is performed at a remote server with greater resources, then the recognition results are returned for use at the mobile device.
U.S. Patent Publication 20110054899 describes a hybrid client-server ASR arrangement for a mobile device in which speech recognition may be performed locally by the device and/or remotely by a remote ASR server depending on one or more criteria such as time, policy, confidence score, network availability, and the like. An example screen shot of the initial prompt interface from one such mobile device ASR application, Dragon Dictation™ for iPhone™, is shown in FIG. 1A which processes unprompted speech inputs and produces representative text output. FIG. 1B shows a screen shot of the recording interface for Dragon Dictation™ for iPhone™. FIG. 1C shows an example screen shot of the results interface produced for the ASR results by Dragon Dictation™ for iPhone™.
FIG. 2 A-C shows some example screen shots of another mobile device application, Dragon Mobile Assistant™, which processes speech query inputs and obtains simultaneous search results from a variety of top websites and content sources. Such applications require adding a natural language understanding component to an existing web search algorithm in order to extract semantic meaning from the input queries. This can involve using approximate string matching to discover semantic template structures. One or more semantic meanings can be assigned to each semantic template. Parsing rules and classifier training samples can be generated and used to train NLU models that determine query interpretations (sometimes referred to as query intents). Currently, a dialog application such as Dragon Mobile Assistant™ can only handle one dialog task at a time. Once a dialog task is started, it must be finished or cancelled before another conversation can start. Performing two tasks that use the same objects means that anaphora need to be resolved, which is complicated for the user and on the server side. Also, it is impractical to make more than one task progress at the same time.