A speech recognition system determines representative text corresponding to input speech. Typically, the input speech is processed into a sequence of digital frames. Each frame can be thought of as a multi-dimensional vector that represents various characteristics of the speech signal present during a short time window of the speech. In a continuous recognition system, variable numbers of frames are organized as “utterances” representing a period of speech followed by a pause, which in real life loosely corresponds to a spoken sentence or phrase.
The system compares the input utterances to find acoustic models that best match the frame characteristics and determine corresponding representative text associated with the acoustic models. Modern acoustic models typically use Hidden Markov Models and they model speech sounds (usually phonemes) using mixtures of Gaussians. Often these phoneme models represent phonemes in specific contexts, referred to as PELs (Phonetic Elements), e.g. triphones or phonemes with known left and/or right contexts.
State sequence models can be scaled up to represent words as connected sequences of acoustically modeled phonemes, and phrases or sentences as connected sequences of words. When the models are organized together as words, phrases, and sentences, additional language-related information is also typically incorporated into the models in the form of language modeling.
The words or phrases associated with the best matching model structures are referred to as recognition candidates or hypotheses. A system may produce a single best recognition candidate—the recognition result—or a list of several hypotheses, referred to as an N-best list. Further details regarding continuous speech recognition are provided in U.S. Pat. No. 5,794,189, entitled “Continuous Speech Recognition,” and U.S. Pat. No. 6,167,377, entitled “Speech Recognition Language Models,” the contents of which are incorporated herein by reference.
Speech recognition can be classified as being either speaker independent or speaker dependent. The models in a speaker dependent system are specific to an individual user. Known speech inputs from the user are used to adapt a set of initially generic recognition models to specific speech characteristics of that user. The speaker adapted models form the basis for a user profile to perform speaker dependent or speaker adapted speech recognition for that user.
Speaker dependent systems traditionally use an enrollment procedure to initially create a user profile and a corresponding set of adapted models before a new user can use the system to recognize unknown inputs. During the enrollment procedure, the new user inputs speech corresponding to a known source script that is provided. During this enrollment process, the acoustic models are adapted to the specific speech characteristics of that user. These adapted models form the main portion of the user profile and are used to perform post-enrollment speech recognition for that user. Further details regarding speech recognition enrollment are provided in U.S. Pat. No. 6,424,943, entitled “Non-Interactive Enrollment in Speech Recognition,” which is incorporated herein by reference.
Speaker dependent speech recognition systems running on modern desktops use adaptation at many simultaneous levels to improve accuracy and recognition speed. Some of these techniques, such as cepstral normalization, histogram normalization, or speaker adaptive training (SAT) operate directly on the input speech feature stream. Others, such as maximum likelihood linear regression (MLLR) and maximum a posteriori parameter estimation (MAP) operate by transforming speech recognition models to better fit the incoming signal.
One typical use of MLLR adaptation in speech recognition has been to transform sets of Gaussian mixtures models which share some property, such as the same center phoneme. These sets are referred to as “classes,” and MLLR in this context can be thought of as class-based MLLR.
One specific form of MLLR, constrained MLLR (cMLLR) has been used for several years in state of the art recognition systems. In contrast to generic MLLR, cMLLR constrains the linear transformations to modify both the model means and variances in a consistent fashion. The resulting transformation can then be inverted and applied in the feature space rather than in the model space. In specific example of the use of cMLLR uses an online unsupervised feature space adaptation (OUFA) technique, described further in U.S. patent application Ser. No. 11/478,837, entitled “Non-Enrolled Continuous Dictation,” the contents of which are incorporated herein by reference.
Most server-based speech recognition systems avoid most types of adaptation, and particularly model-space adaptation, but FIG. 1 shows an example of a multiple speaker application using OUFA and a single class cMLLR transform. During initial enrollment, an enrollment speech input from a new user is used to perform MLLR adaptation of an initial speaker independent acoustic model (SIAM) to generate a speaker dependent acoustic model (SDAM) so that the system has one such SDAM for each of the N speakers registered with the system, SDAM1-SDAMN. The enrollment speech input is also used to perform single class cMLLR adaptation wherein the inverse cMLLR transform is used to adapt feature processing blocks—speaker dependent SAT models (one of the blocks SAT1-SATN) and OUFA models (one of the blocks OUFA1-OUFAN) for that speaker.
After enrollment, unknown input speech from a given user is initially processed by a speaker dependent front end (one of the blocks SDFE1-SDFEN) to produce a set of speech features representative of the speech input. These features are then processed by the speaker dependent blocks SAT and OUFA for that user, and input to the recognition engine. The recognition engine compares the input speech features to the SDAM for that user as constrained by the recognition language model and search algorithm to produce a representative recognition output.