Hidden Markov Models (HMMs) are a well established framework for a variety of pattern recognition applications, including, most prominently, speech recognition applications. Hidden Markov Models consist of interconnected states where each state is represented by a Gaussian distribution or by a mixture of Gaussians. Speech units, such as phonemes, are associated with one or more HMM states. Typically, the means and variances of the distributions for the HMMs are learned from training data.
One technique for training HMM parameters is to use a maximum likelihood criterion based on an Expectation-Maximization algorithm. Under this technique, the parameters are adjusted to maximize the likelihood of a set of training data. However, due to data sparseness, maximum likelihood does not produce HMM parameters that are ideal for data that is not well-represented in the training data.
Another method of training HMM parameters is known as discriminative training. In discriminative training, the goal is to set the HMM parameters so that the HMM is able to discriminate between a correct word sequence and one or more incorrect word sequences.
One specific form of discriminative training is known as minimum classification error (MCE) training. In MCE training, the HMM parameters are trained by optimizing an objective function that is closely related to classification errors, where a classification error is the selection of an incorrect word sequence instead of a correct word sequence. Although MCE training has been performed before, conventional MCE optimization has been based on a sequential gradient-decent based technique named Generalized Probabilistic Decent (GPD), which optimizes the MCE objective function as a highly complex function of the HMM parameters. Such gradient-based techniques often require special and delicate care for tuning the parameter-dependent learning rate.
Another form of discriminative training is known as maximization of mutual information (MMI). Under MMI, an objective function related to the mutual information is optimized using one of a set of optimization techniques. One of these techniques is known as Growth Transformation (GT) or Extended Baum-Welch (EBW). However, GT/EBW was developed for rational functions such as mutual information. Because MCE does not provide a rational function, growth transformation/extended Baum-Welch optimization has not been applied to minimum classification error training.
The discussion above is merely provided for general background information and is not intended to be used as an aid in determining the scope of the claimed subject matter.