Automatic speech recognition (ASR) systems try to determine a representative meaning (e.g., text) corresponding to speech inputs. FIG. 1 shows various hardware components of a typical ASR system such as a dictation system on a user desktop. A computer system 10 includes a speech input microphone 11 which is connected through a suitable preamplifier 13 to an analog-to-digital (A/D) converter 15. A front-end pre-processor 17 typically performs a Fourier transform so as to extract spectral features to characterize the input speech as a sequence of representative multi-dimensional vectors and performs the analysis and adaptation in a potentially derived feature space. A speech recognition processor 12, e.g., an Intel Core i7 processor or the like, is programmed to run one or more specialized computer software processes to determine a recognition output corresponding to the speech input. To that end, processor memory 120, e.g., random access memory (RAM) and/or read-only memory (ROM) stores the speech processing software routines, the speech recognition models and data for use by the speech recognition processor 12. The recognition output may be displayed, for example, as representative text on computer workstation display 14. Such a computer workstation would also typically include a keyboard 16 and a mouse 18 for user interaction with the system 10. Of course, many other typical arrangements are also familiar such as an ASR system implemented for a mobile device such as a cell phone, ASR for the passenger compartment of an automobile, client-server based ASR, etc.
The ASR system compares the input utterances to find statistical acoustic models that best match the vector sequence characteristics and determines corresponding representative text associated with the acoustic models. More formally, given some input observations A, the probability that some string of words W were spoken is represented as P(W|A), where the ASR system attempts to determine the most likely word string:
      W    ^    =      arg    ⁢                  ⁢                  max        W            ⁢              P        ⁡                  (                      W            ⁢                          |                        ⁢            A                    )                    Given a system of statistical acoustic models, this formula can be re-expressed as:
      W    ^    =      arg    ⁢                  ⁢                  max        W            ⁢                        P          ⁡                      (            W            )                          ⁢                  P          ⁡                      (                          A              ⁢                              |                            ⁢              W                        )                              where P(A|W) corresponds to the acoustic models and P(W) represents the value of a statistical language model reflecting the probability of given word in the recognition vocabulary occurring.
The acoustic models are typically probabilistic state sequence models such as hidden Markov models (HMMs) that model speech sounds using mixtures of probability distribution functions (Gaussians). Acoustic models often represent phonemes in specific contexts, referred to as PELs (Phonetic Elements), e.g. triphones or phonemes with known left and/or right contexts. State sequence models can be scaled up to represent words as connected sequences of acoustically modeled phonemes, and phrases or sentences as connected sequences of words. When the models are organized together as words, phrases, and sentences, additional language-related information is also typically incorporated into the models in the form of a statistical language model.
The words or phrases associated with the best matching model structures are referred to as recognition candidates or hypotheses. A system may produce a single best recognition candidate—the recognition result—or multiple recognition hypotheses in various forms such as an N-best list, a recognition lattice, or a confusion network. Further details regarding continuous speech recognition are provided in U.S. Pat. No. 5,794,189, entitled “Continuous Speech Recognition,” and U.S. Pat. No. 6,167,377, entitled “Speech Recognition Language Models,” the contents of which are incorporated herein by reference.
Many current speech recognition applications can benefit from long-term speaker adaptation using speaker logs, and for that, discriminative methods present a promising approach given its previous successes on acoustic model training. There have been large-vocabulary speech recognition experiments investigating feature-space and model space discriminative adaptation methods for long-term speaker adaptation. The experimental results suggest that though on average discriminative adaptation does not obtain a large gain over the ML-based baseline, there still are some test speakers that receive significant improvement. Speakers with high error rates under the speaker independent model tend to have larger gains with discriminative adaptation. These findings reveal that using discriminative methods for long-term speaker adaptation can provide advantages for speech recognition systems. But it is expensive to run adaptation for all speakers.