A speech recognition system analyzes a user's speech to determine what the user said. Most speech recognition systems are frame-based. In a frame-based system, input speech is processed into a sequence of digital speech feature frames. Each speech feature frame can be thought of as a multi-dimensional vector that represents various characteristics of the speech signal present during a short time window of the input speech.
The speech recognition system compares the input speech frames to find statistical models that best match the speech feature characteristics and then determines a corresponding representative text or semantic meaning associated with the statistical models. Modern statistical models are state sequence models, such as Hidden Markov Models (HMMs), that model speech sounds (usually phonemes) using mixtures of Gaussian distributions.
Many speech recognition systems use discriminative training techniques that are speech recognition techniques that dispense with a purely statistical approach to HMM parameter estimation and instead optimize some classification-related measure of training data. Examples of such discriminative training techniques are maximum mutual information (MMI), minimum classification error (MCE), and minimum phoneme error (MPE) techniques. Such speech recognition techniques require the processing of numerous feature vectors of speech objects.