The present invention relates to model estimation and, more particularly, to apparatus and methods for performing model estimation utilizing a discriminant measure which enables the design of classifiers to optimize classification accuracy.
In a general classification problem with N classes, the training procedure involves constructing models for each of the N classes using samples of data points from each of these classes. Subsequently, these models are used to classify an incoming test data point by evaluating its "closeness" to each of the N models. There are several possible categories of models that are often used, for instance, linear discriminants (as disclosed by P. O. Duda and P. E. Hart in "Pattern Classification and Scene Analysis", Wiley, New York, 1973), neural networks (as disclosed by R. Lippman in "Pattern Classification Using Neural Networks", IEEE Communications Magazine, pp. 11:47-64, 1989) and gaussian mixtures. The training procedure usually involves selecting a category of model (including its size), and then adjusting the parameters of the model to optimize some objective function on the training data samples. The first step of the training procedure, which involves choosing the type and size of the model, is generally done in an ad-hoc fashion, but more recently, some objective criteria have been introduced as an alternative (e.g., as disclosed by Y. Normandin in "Optimal Splitting of HMM Gaussian Mixture Components with MMIE Training", Proceedings of the ICASSP, pp. 449-452, 1995). The second step of the training procedure involves training the parameters of the model. Several objective functions have been developed in the past to do this, the most commonly used ones being (i) maximizing the likelihood of the data points given the correct class (as disclosed by A. P. Dempster, N. M. Laird, D. B. Rubin in "Maximum Likelihood Estimation from Incomplete Data", Journal of the Royal Statistical Society (B), vol. 39, no. 1, pp. 1-38, 1979) or (ii) maximizing the likelihood of the correct class given the data points (as disclosed by L. R. Bahl, P. F. Brown, P. V. de Souza, R. L. Mercer in "Maximum Mutual Information Estimation of Hidden Markov Model Parameters for Speech Recognition", Proceedings of the ICASSP, pp. 49-52, 1986 and as disclosed by B. H. Juang, W. Chou, C. H. Lee in "Minimum Classification Error Rate Methods for Speech Recognition", IEEE Trans. Speech and Audio Processing, vol. 5, pp. 257-265, May 1997).
When modelling data samples corresponding to a class with, for example, a mixture of gaussians, the parameters of the model are the number of mixture components and the means, variances and priors distributions of these components. In general, the number of mixture components is chosen using some simple ad-hoc rule subject to very loose constraints; for instance the number of components has to be sufficient to model the data reasonably well but not so many as to overmodel the data. A typical example of the choice of the number is to make it proportional to the number of data samples. However, such methods may result in models that are sub-optimal as far as classification accuracy is concerned. For instance, if the number of gaussians modelling a class is inadequate, it may result in the class being mis-classified often, and if too many gaussians are chosen to model a class, it may result in the model encroaching upon the space of other classes as well. These two conditions will be referred to as "non-aggressive" and "invasive" models, respectively.
Once the number of mixture components has been decided, the next step is to estimate the means and variances of the components. This is often done so as to maximize the likelihood of the training data samples. Though this necessarily gives the parameters that best fit the model to the data, it may result in a model that encroaches on the sample space of a different class, and hence lead to misclassifications. An alternative to maximum likelihood is maximum mutual information or MMI (as disclosed in the L. R. Bahl article cited above) or minimum classification error or MCE (as disclosed in the B. H. Juang article cited above), where the model parameters are directly estimated to minimize misclassification error. Methods that use such objective functions are called discriminant methods because they try to maximize the discrimination power of the models.
However, it would be advantageous to provide an objective function that can be used to both select an optimum size for the model and to train the parameters of the model which, further, may be applied to any category of classifier with one example being gaussian classifiers. It would also be advantageous to provide a measure that can be used to determine the number of mixture components in order to avoid classes of models characterized as "non-aggressive" and "invasive". It would still further be advantageous to provide an objective function that falls into the category of discriminant objective functions, but differs from MMI and MCE, and that can be used to tune the size of the models as well as estimate the parameters of, for example, the gaussians in the models.