Pattern recognition and data classification techniques, often referred to as supervised learning, attempt to find an approximation or hypothesis to a target concept that assigns objects (such as processes or events) into different categories or classes. Pattern recognition can normally be divided into two phases, namely, a training phase and a testing phase. The training phase applies a learning algorithm to training data. The training data is typically comprised of descriptions of objects (a set of feature variables) together with the correct classification for each object (the class variable).
The goal of the training phase is to find correlations between object descriptions to learn how to classify the objects. In speech recognition systems, for example, the goal of the training phase is to find the Hidden Markov Model (HMM) parameters that will result in a speech recognizer with the lowest possible recognition error rate. The training data is used to construct models in which the class variable may be predicted in a record in which the observations are given but the class variables for each observation needs need to be determined. Thus, the end result of the training phase is a model or hypothesis (e.g., a set of rules) that can be used to predict the class of new objects. The testing phase uses the model derived in the training phase to predict the class of testing objects. The classifications made by the model is compared to the true object classes to estimate the accuracy of the model.
More specifically, the training or adaptation is typically done by maximizing some objective function, F(λ). Maximum likelihood as an optimization criterion has been widely used in many aspects of speech recognition. One successful example is estimating the HMM model parameters, λ, in such a way that the likelihood (probability) of the observation sequence, O=(o1, o2, . . . , OT), given the current model parameters, P(O|λ), is locally maximized using an iterative procedure such as the Baum-Welch method.
A meaningful objective function should satisfy conditions that, whenever F({circumflex over (λ)})>F(λ), {circumflex over (λ)} results in a better classifier or decoder than λ. This is not always true, however, when the likelihood P(O|λ) is used as the objective function because there is no direct relation between the likelihood and the recognition error rate. A need therefore exists for an improved objective function that not only maximizes the discrimination between classes of training data, but also moves the criterion used in parameter estimation of a speech recognition system closer to the decoding criterion, therefore reducing the recognition error rate.