An automatic speech recognition (ASR) system tries to determine a representative meaning (e.g., text) corresponding to input speech. Typically, the input speech is processed into a sequence of digital frames. Each frame can be thought of as a multi-dimensional vector that represents various characteristics of the speech signal present during a short time window of the speech. In a continuous recognition system, variable numbers of frames are organized as “utterances” representing a period of speech followed by a pause which in real life loosely corresponds to a spoken sentence or phrase.
The system compares the input utterances to find acoustic models that best match the vector sequence characteristics and determines corresponding representative text associated with the acoustic models. Modern acoustic models typically use state sequence models such as Hidden Markov Models that model speech sounds (usually phonemes) using mixtures of probability distribution functions, typically Gaussians. Phoneme models often represent phonemes in specific contexts, referred to as context-dependent phonemes, e.g. triphones or phonemes with known left and/or right contexts.
State sequence models can be scaled up to represent words as connected sequences of acoustically modeled phonemes, and phrases or sentences as connected sequences of words. When the models are organized together as words, phrases, and sentences, additional language-related information is also typically incorporated into the models in the form of language modeling.
The words or phrases associated with the best matching model structures are referred to as recognition candidates or hypotheses. A system may produce a single best recognition candidate—the recognition result—or multiple recognition hypotheses in various forms such as an N-best list, a recognition lattice, or a confusion network. Further details regarding continuous speech recognition are provided in U.S. Pat. No. 5,794,189, entitled “Continuous Speech Recognition,” and U.S. Pat. No. 6,167,377, entitled “Speech Recognition Language Models,” the contents of which are incorporated herein by reference.
Some ASR arrangements are generic to a broad class of multiple speakers and input channels (e.g. microphone type) and acoustic environments, which is referred to as speaker independent (SI) and channel/environment independent speech recognition. Such systems can be specialized to a specific speaker, speaker group, speaking style, channel or environment by means of adaptation. For example, an SI arrangement can be adapted for a specific individual speaker to create a speaker dependent (SD) arrangement. Typically, the acoustic model component of the recognizer is adapted, but some language model adaptation techniques also have been proposed and successfully deployed.
It is well known, however, that the performance of an adapted speech recognizer degrades whenever there is a mismatch between the adaptation data (speaker, speaker group, channel, speaking style, acoustic environment) that the recognizer has been adapted to and the input speech data that it actually faces in application. An adapted SD recognizer, for example, strongly degrades in performance when being applied on a different speaker than the one it was adapted to.
In many speech recognition applications such as mobile handset ASR or server-based speech recognition, a speaker often cannot be correctly identified. Still, the applications usually do have a main or prime user and adaptation of the application to the prime user is desirable to improve recognition accuracy. Hence, an adapted system should have some means of achieving only minor degradation for general speakers (“guest speakers”) despite the adaptation to the prime user.