1. The Field of the Present Invention
The present invention relates generally to an apparatus, system and method for creating a general-purpose adaptive or static machine-learning classifier using prediction by partial matching (“PPM”) language modeling. This classifier can incorporate homogeneous or heterogeneous feature types; variable-size contexts; sequential or non-sequential features. Features are ordered (linearized) by information saliency; and truncation of least-informative context is used for backoff to handle previously unseen events. Labels may be endogenous (from within the group) or exogenous (outside the group) of the feature types.
2. General Background
The problem we are trying to solve is simple to state: can we exploit the known excellent modeling properties of the PPM language model approach for general-application machine learning? The PPM language models are easy to understand and implement; have a solid theoretical basis; and have proven to construct state-of-the-art models for compression applications. Over a long period the entropy measures generated using the PPM language models have been the state of the art.
Furthermore, PPM language models were from the beginning adaptive because compression required them to be that way. Adaptive models learn from what they are exposed to over time. This can be helpful when compressing heterogeneous collections of documents, and different kinds of files. Poorly compressing language models can be ignored or discarded and new models started.
It is uncommon for a state-of-the-art machine-learning classifier to have both static and adaptive implementations. A few algorithms (e.g., naive Bayes and the non-parametric lazy learners such as k nearest neighbor) have this capability (nevertheless, the adaptive variants are not frequently used; many applications cannot supply accurate truth data for updates).
Adaptation for natural language tasks is incredibly valuable. It has been observed repeatedly that models that adapt to groups or individuals outperform, sometimes very substantially, generic models. Furthermore, models that can incorporate feedback are able to improve over time. The PPM algorithm and its PPM classifier embodiment permit adaptation.
Traditional sequential techniques such as HMMs have very large numbers of parameters to estimate. In order to build models from very large amounts of data, it is usually necessary to throw away less frequent training data; and use small contents (bigrams instead of trigrams). In addition, most language modeling approaches use only homogenous features (usually words). Other techniques that allow heterogeneous feature types (e.g., maximum entropy or conditional random fields) estimate parameters using computationally expensive numerical methods. They are also not adaptive or updatable. In addition, for many machine-learning tasks they often require additional sequence computations (e.g., Viterbi algorithm) to determine optimal results. Because PPM classifiers can incorporate preceding and succeeding contexts, point-wise classifications can be generated that don't require Viterbi processing to determine optimal predictions.
The uses of the PPM compression scheme for natural language tasks by the University of Waikato research group has used almost always very different methods for dealing with classification than the PPM classifier. Document categorization, genre determination, author identification or language identification have used very simple minimal cross-entropy measures using multiple (class-specific) language models. These are more or less straight-forward applications of language modeling.
On the other hand, the approaches the research group used for such tasks as extraction of entities or word segmentation is quite different, often involving integration of multiple models and Viterbi (or Viterbi-like) computation of optimal sequences.
The PPM classifier approach proposed here use either multiple or single PPM classifiers that are trained in a manner that is not much different from other classifiers. For example, a word segmentation task would be approached by creating a data set that identifies those points in a text where segments appear. The PPM classifier would then be trained on labeled instances of homogenous sequences of characters, before and after (non-contiguous), a given focus, with only two labels (i.e., the exogenous labels split, nosplit). For any given context, the PPM classifier supplies the probabilities of split and nosplit (or in a symbolic variant, the most likely label). Note that more context can be used than in a traditional PPM model that has only left (historical) context. Furthermore, other information can be included in the PPM classifier (e.g., the lexical class of characters, such as lexical, numeric or symbol; or their phonetic properties, such as consonant vs. vowel). Some languages place limitations on syllable types (e.g., Polynesian languages always have only open syllables; syllables must end in a vowel) and this could be exploited using very small amounts of training data if these phonetic properties were provided.
U.S. Pat. No. 8,024,176 teaches a very specific sub-type of the PPM classifier. Minimal suffix prediction uses only homogenous sequential features only at the ends of words (suffixes), exogenous labels, variable-length suffix contents, symbol-only prediction and minimization. The '176 patent provides no method for incorporating prefixes or other non-sequential features such as the previous or following words into the predictor.
The Fisher text-to-phone prediction algorithm allows for preceding and succeeding contexts but uses an unmotivated backoff approach. (The author performed a few suggestive experiments that indirectly supportive of his proposed backoff procedure.) The text-to-phone algorithm uses homogeneous, but non-contiguous features (characters); exogenous labels (phones); fixed length preceding and succeeding contexts; and statistical and symbolic prediction and minimization. The Fisher algorithm is static, does not have a well justified backoff strategy (as opposed to the PPM language modeling approach). Fisher applies his algorithm only for text-to-phone mapping.
The Bilmes and Kirchhoff factored language model (FLM) is a static, non-minimized language modeling approach for predicting the most likely next word, using sequential, non-homogeneous features; with fixed length preceding contexts and a well-established Bayesian, Markov model approach (not PPM); and with endogenous labels (words). The FLM offers one or more backoff strategies. FLM implementations use either custom-designed or ad hoc backoff approaches; or use an optimization techniques (the authors offer a genetic algorithm) to construct and rank backoff strategies. The authors do not use any form of information saliency to determine optimal backoff strategies. This modeling technique is used only for language modeling (predicting the next word). The authors do not suggest the use of this algorithm for any other machine-learning problems.
The classic PPM language model is equivalent to a hidden Markov model (HMM). The most important differences between them are (a) PPM employs its own version of backoff using its “exception” mechanism to compute a mixture of ngram models to estimate probabilities; and (b) PPM is an adaptive language. In other words, the PPM classifier, for sequential machine-learning tasks for which there is no change in feature ordering introduced by information salience values, will make predictions that are nearly identical to a HMM classifier. Small differences can arise because the two types of classifiers use different backoff methods.
In summary, these earlier publications describe technologies that embody some of the set of attributes of the PPM classifier. However, in no instance did these generalize their approach to tackle other machine-learning tasks.
The PPM classifier stands alone as a general-purpose machine-learning classifier with application to a wide range of classification tasks.
What is needed is a classifier that has a solid theoretical basis, with validated excellent model-building performance, that can perform:
sequential or non-sequential features;
contiguous or non-contiguous features;
static or adaptive modeling;
homogenous or heterogeneous feature types;
endogenous or exogenous labels;
statistical or symbolic classification;
variable-size contexts;
theoretically and empirically justified backoff to make optimal predictions for unseen events;
complete or minimized models;
efficient processing and low-memory footprint for training and prediction; and
simple training and prediction implementations using widely available (hash tables, trees) programming components.