1. Field of Technology
The invention relates generally to machine learning and two-class classification systems.
Glossary
The following definitions are provided merely to help readers generally to understand commonly used terms in machine learning, statistics, and data mining. The definitions are not designed to be completely general but instead are aimed at the most common case. No limitation on the scope of the invention (see claims section, infra) is intended, nor should any be implied.
“Classification” shall mean mapping (e.g., via “feature” extraction, statistical modeling, model selection, parameter estimation, non-para-method modeling, or the like) from unlabeled records (typically represented by “features” vectors) to discrete classes; “classifiers” have a form or model (e.g., a decision tree) plus an induction learning procedure, and an interpretation procedure; some classifiers also provide scores probability estimates which can be related to a predetermined factor, such as a threshold value, to yield a discrete class decision; Support Vector Machines, Naïve Bayes, logistic regression, C4.5 decision trees, and the like, are examples of known classifiers.
“Data set” shall mean a schema and a set of “records” matching the schema (no ordering of “records” is assumed; a set of values of interest, “category” or “class.”; often a schema of discrete “positives” and “negatives,” as in medical tests.
“F-measure” shall mean the harmonic mean of “Precision” and “Recall, which may be represented by a relationship: 2PR/P+R, where “P” is Precision and “R” is Recall.
“Feature value” is an attribute and its value for a given record; “feature vector” shall mean a list of feature values describing a “record;” also sometimes referred to as an “example,” a “case,” or a “tuple.”
“Induction algorithm” or “Inducer” shall mean an algorithm that takes as input specific feature vectors labeled with their class assignments and produces a model that generalizes beyond the data set; most induction algorithms generate a “model” that can then be used as classifiers, regressors, patterns for human consumption, and input to subsequent stages of “knowledge discovery” and “data mining.”
“Record” shall mean each single object from which a model will be learned or on which a model will be used; generally described by “feature vectors;” also sometimes referred to as an “example,” or “case.”
“Knowledge discovery” shall mean the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data.
“Machine learning” (a sub-field of artificial intelligence) is the field of scientific study that concentrates on “induction algorithms” and other algorithms that can be said to learn; generally, it shall mean the application of “induction algorithms,” which is one step in the “knowledge discovery” process.
“Model” shall mean a structure and corresponding interpretation that summarizes or partially summarizes a data set for description or prediction.
“Precision” is the percentage of items classified as positive that are actually positive.
“Recall” is the percentage of actual positives that are classified as positive (see also, “tpr,” infra).
2. General Background
Machine learning encompasses a vast array of tasks and goals. Document categorization, news filtering, document routing, personalization, and the like constitute an area of endeavor where machine learning can greatly improve computer usage. As one example, when searching the Word Wide Web (hereinafter “Web”), a user may develop a personalization profile, a positive class-of-interest for selecting news articles-of-interest from the millions of news articles available at any given moment in time. Machine learning for text classification is the cornerstone of document categorization, news filtering, document routing and personalization.
The potential is great for machine learning to categorize, route, filter and search for relevant text information. Good feature selection may improve classification accuracy or, equivalently, reduce the amount of training data needed to obtain a desired level of performance, and conserve computation, storage and network resources needed for training and all future use of the classifier. For example, to build and populate a Web portal or news directory, a data mining practitioner would identify a modest number of training examples for each relevant category, and then an induction algorithm can learn the pattern and identify additional matches to populate the portal or directory. In such text-based domains, effective feature selection is essential to make the learning task tractable and more accurate. However, problem sizes continue to scale up with the explosive growth of the Internet. The goals are accuracy, F-measure, precision, and recall, each of which may be appropriate in different situations.
In text classification, a data mining practitioner typically uses a “bag-of-words model:” a sample model is shown in FIG. 3 (Prior Art), in tabular format which, in practice, may have many more rows and columns (represented in the table as “ . . . ”). Each position in the input feature vector corresponds to a given word, e.g., the occurrence of the word “free” may be a useful feature in classifying junk e-mail, also colloquially referred to as “spam.” The number of potential words often exceeds the number of training documents by an order of magnitude. Feature selection is necessary to make the problem tractable for a classifier. Well-chosen features can improve substantially the classification accuracy, or equivalently, reduce the amount of training data needed to obtain a desired level of performance. Eliminating insignificant features improves scalability, conserving computation, storage and network resources for the training phase and for every future use of the classifier. Conversely, poor feature selection limits performance since no degree of clever induction can make up for a lack of predictive signal in the input features sent to the classifier. To partially compensate for poor feature selection heuristics, a larger number of features can be selected, but this harms scalability and performance.
It has been found that selecting features separately for each class, versus all together, extends the reach of induction algorithms to greater problem sizes having greater levels of class skew. High class skew, where there are, for example many more negatives than positives, presents a particular challenge to induction algorithms, which are hard pressed to beat the high accuracy achieved by simply classifying everything as a negative majority class. High class skew in the class distribution makes it much more important to supply the induction algorithm with well-chosen features. In text classification problems, there is typically a substantial skew which worsens as the problem size scales upwardly. Returning to an earlier example, in selecting news articles that best match one's personalization profile, the positive class of interest contains many fewer articles on the Web than the negative class background, especially if the background class is e.g., “all new articles posted on the Web.” For multi-class problems, the skew increases with the number of classes. It would seem that the future presents classification tasks with ever increasing skews.
Prior art methods for feature selection—i.e., deciding which features are most predictive indicators to use for training a classifier—are e.g., Information Gain (IG), Odds Ratio, the Chi-Squared Test, and the like, as would be known to practitioners skilled in the art. Each uses a specific formulaic method for selecting features discriminatively for training a classifier. Each begins by counting the number of feature occurrences of each word in the positive class (“tp”) and in the negative class (“fp”). For example, in FIG. 3, the feature “free” occurs in two of the three positive training examples; tp=2, pos=3. These counts are sufficient statistics for computing the method. Improved feature selection is highly important for classification tasks to make it tractable for machine learning and to improve classifier performance.