Text classification techniques are known in the art. Attention is directed, for example, to U.S. Pat. No. 6,182,058 to Kohavi; U.S. Pat. No. 6,278,464 to Kohavi et al.; U.S. Pat. No. 6,212,532 to Johnson et al.; U.S. Pat. No. 6,192,360 to Dumais et al.; and U.S. Pat. No. 6,038,527 to Renz, all of which are incorporated herein by reference. These patents discuss classification systems and methods.
Many data mining or electronic commerce tasks involve classification of data (e.g. text or other data) into classes. For example, e-mail could be classified as spam (junk e-mail) or non-spam. Mortgage applications could be classified as approved or denied. An item to be auctioned could be classified in one of multiple possible classifications in on-line auction web sites such as eBay™. Text classification may be used by an information service to route textual news items covering business, sports teams, stocks, or companies to people having specific interests.
There are multiple different types of classification methods. Some classification methods are rule-based methods. A rule of the form IF {condition} THEN {classification} could be used. However, rule-based systems can become unwieldy when the number of input features becomes large or the logic for the rules becomes complex.
Some classification methods involve use of classifiers that have learning abilities. A classifier is typically constructed using an inducer. An inducer is an algorithm that builds the classifer using a training set comprising records with labels. After the classifier is built, it can be used to classify unlabeled records. A record is also known as a “feature vector,” “example,” or “case.”
The term “feature selection” refers to deciding which features are most predictive indicators to use for training a classifier. In some embodiments, “feature selection” refers to which words or features score the highest Information Gain, Odds Ratio, Chi-Squared, or other statistic. Well chosen features can substantially improve classification accuracy or reduce the amount of training data needed to obtain a desired level of performance. Avoiding insignificant features improves scalability and conserves computation and storage resources for the training phase and post-training use of the classifier. Conversely, poor feature selection limits performance since no degree of clever induction can make up for a lack of prediction signal in the input features sent to the classifier.
Known methods for feature selection include Information Gain (IG), Odds Ratio, Chi Squared, and others. Known classifiers include Support Vector Machines and Naïve Bayes.
The problem of selecting which features to use in a text classification problem where there are multiple categories can be more difficult than in two-category classification systems. For example, given a machine learning problem of classifying each new item to be auctioned into the most appropriate category in the large set of an auction web site's categories, it is desirable to decide which word features are the most predictive indicators to use for training a classifier.