Maximum Entropy (ME) modeling is a general statistical modeling paradigm that may be applied in language modeling and natural language processing to predict linguistic behavior by incorporating various informative features, each encoding some linguistically statistical event, from a corpus of data into a common framework of conditional models. Such modeling, however, may be computationally intensive.
ME modeling may be separated into two main tasks: a feature selection process that chooses from a feature event space a subset of desired features to be included in the model; and a parameter estimation process that estimates the weighting factors for each selected feature. In many applications, however, it may not be clear which features are important for a particular task so that a large feature event space may be required to ensure that important features are not missed. Yet, including all or nearly all features may cause data overfitting, may slow the predictive process, and may make the resulting model too large for resource-constrained applications.
It is believed that more of the effort in ME modeling may have been focused on parameter estimation, and that less effort has been made in feature selection since it may not be required for certain tasks when parameter estimating algorithms are sufficiently fast. However, when the feature event space is necessarily large and complex, it may be desirable to perform at least some form of feature selection to speed up the probability computation, to reduce memory requirements during runtime, and to shorten the cycle of model selection during the training. Unfortunately, when the feature event space under investigation is large, feature selection itself may be difficult and slow since the universe of all the possible feature subsets to choose from may be exceedingly large. In particular, the universe of all possible feature subsets may have a size of 2|Ω|, where |Ω| is the size of the feature event space.
Various techniques may be applied to facilitate and/or minimize the task of feature selection. As discussed in Ronald Rosenfeld, “Adaptive Statistical Language Modeling: A Maximum Entropy Approach”, Ph.D. thesis, Carnegie Mellon University, April 1994 (“Rosenfeld (1994)”); Adwait Ratnaparkhi, “Maximum Entropy Models for Natural Language Ambiguity Resolution”, Ph.D. thesis, University of Pennsylvania, 1998 (“Ratnaparkhi (1998)”); J. Reynar and A. Ratnaparkhi, “A Maximum Entropy Approach to Identifying Sentence Boundaries”, Proceedings of the Fifth Conference on Applied Natural Language Processing 1997, Washington D.C., 16-19 (“Reynar and Ratnaparkhi (1997)”); Rob Koeling, “Chunking with Maximum Entropy Models”, Proceedings of CoNLL-2000 and LLL-2000, Lisbon, Portugal, 139-141 (“Koeling (2000)”), a simple count cutoff technique may be used, in which only the features that occur in a corpus more than a pre-defined cutoff threshold are selected. As discussed in Ratnaparkhi (1998), the count cutoff technique may be fast and may be easy to implement, but may contain a large number of redundant features. A more refined algorithm, the so-called incremental feature selection (IFS) algorithm referred to in Adam L. Berger, Stephen A. Della Pietra, and Vincent J. Della Pietra, “A Maximum Entropy Approach to Natural Language Processing”, Computational Linguistic, 22 (1): 39-71, 2000 (“Berger et al. (1996)”), requires that only one feature be added at each selection stage and that estimated parameter values be retained for the features selected in the previous stages. In this regard, for each selection stage, the IFS algorithm may be used to compute the feature gains for all the candidate features (a measure of the informative content of the features), select the feature with the maximum gain, and then adjust the model with the selected feature.
As compared to the simple count cutoff technique, the IFS algorithm may remove the redundancy in the selected feature set, but the speed of the algorithm may be an issue for complex tasks. Having realized the drawback of the IFS algorithm, Adam L. Berger and Harry Printz “A Comparison of Criteria for Maximum Entropy/Minimum Divergence Feature Selection”, Proceedings of the 3rd conference on Empirical Methods in Natural Language Processing, Granda, Spain 1998 (“Berger and Printz (1998)”) proposed an φ-orthogonal condition for selecting k features at the same time without affecting much the quality of the selected features. While this technique may be applicable for certain feature sets, such as link features between words, the φ-orthogonal condition may not hold if part-of-speech (POS) tags are dominantly present in a feature subset.
Stanley Chen and Ronald Rosenfeld, in “Efficient Sampling and Feature Selection in Whole Sentence maximum Entropy Language Models”, Proceedings of ICASSP-1999, Phoenix, Ariz. (“Chen and Rosenfeld (1999)”), experimented on a feature selection technique that uses a χ2 test to see whether a feature should be included in the ME model, where the χ2 test is computed using the counts from a prior distribution and the counts from the real training data. It may be sufficient for some language modeling tasks. However, a relationship between % test score and the likelihood gain, which may be required to optimize the ME model, may be absent.
In sum, the existing feature selection algorithms may be slow, may select features with less than optimal quality, may involve a non-trivial amount of manual work, or may have a low reduction rate. Consequently, those who use existing feature selection algorithms may use a much smaller or constrained feature event space, which may miss important undiscovered features, or they may build a larger model, which may impose an extra demand on system memory requirements.