1. Field of the Invention
The present invention relates generally to training and evaluating probability models and more particularly relates to a method and apparatus for determining the gain and training starting point of a feature function for maximum entropy/minimum divergence (MEMD) probability models, in applications such as language modeling for speech recognition systems, language translation systems, grammar checking systems and the like.
2. Description of the Related Art
In language modeling applications, probability models are used to predict the occurrence of a word based on some prior history. In the past, n-gram models have been frequently used to predict a current word based on n-1 previous words. While this model has been found to be generally useful, there are occasions when this model is unable to accurately predict the outcome based on the limited history of n-1 words. In order to improve the performance of language modeling applications, it is desirable to augment the performance of the n-gram base model during those occurrences where the n-gram model is inadequate.
One powerful technique for constructing probability models is known as maximum entropy/minimum divergence (MEMD) modeling. An MEMD model is constructed from a base model and a set of feature functions (features), whose empirical expectations, determined from a given training corpus (i.e., a large body of text for a language model), are known.
In a language modeling application, the features are generally binary functions which express a relationship between a current word being predicted and a history. The utility of the feature as a predictive element is referred to as the gain of the feature.
One way of finding potentially useful features is to inspect the training corpus and observe which words co-occur. A large corpus, consisting of tens or hundreds of millions of words, and possibly a grammatical parsing, will generally yield millions of potential features (feature candidates). In order to maximize the efficiency of an MEMD model, only those features which exhibit the highest predictive power (gain) should be used in constructing the model. Therefore, the fundamental task with MEMD modeling is to evaluate the gain of the candidate features, rank them according to their utility, and retain those features that exhibit the highest gain values.
Currently, several methods for determining the gain of a feature in an MEMD model have been suggested. For example, in the article entitled "Inducing Features of Random Fields" by Della Pietra et al., Technical Report CMU-CS 95-144, School of Computer Science, Carnegie Melon University, May 1995, a closed form analytic expression for the gain of a feature is derived. The derived expression applies to the gain of a feature for a single joint probability model, p(w h), which is constructed from a single joint prior, q(w h). However this process is not applicable to systems having many conditional models, p (w.vertline.h), where each model is individually normalized. This is the case in language models, and many other applications of MEMD models.
Another method used in the prior art for determining the gain of a feature is referred to as the Newton-Raphson method, which generally uses an iterative process to calculate the gain of the feature. In general, the Newton-Raphson method requires several passes through the corpus, where the first and second derivative of a gain function are calculated to determine the argument value where the first derivative of the gain equation is equal to zero (maximum gain). The gain of the feature is then determined from that argument value. However, in order to reach convergence, several passes through the corpus are required for this process. Because a useful corpus in a language modeling application is on the order of tens of millions of words, multiple passes through the corpus generally require a prohibitive amount of computer processing time.
As an alternative, the Newton-Raphson can theoretically be applied in a single pass through the corpus. To do this, the value of the conditional expectation of the base probability model, q.sub.hf, needs to be solved for each position in the corpus where q.sub.hf &gt;0. Each solution is then stored for subsequent use. However, such an operation requires vast amounts of computer storage capacity. For practical language modeling applications, the required storage capacity is on the order of hundreds of gigabytes, an amount not generally available in a typical computer system.
Accordingly, there is a need in the art for a method of determining the gain of a feature in a probability model in only a single pass through the corpus which only requires a modest dedication of memory and computer processing time.