.sctn. 1.1 Field of the Invention
The present invention concerns determining whether an object, such as a textual information object for example, belongs to a particular category or categories. This determination is made by a classifier, such as a text classifier for example. The present invention also concerns building a (text) classifier by determining appropriate parameters for the (text) classifier.
.sctn. 1.2 Related Art
.sctn. 1.2.1 THE NEED FOR TEXT CLASSIFICATION
To increase their utility and intelligence, machines, such as computers for example, are called upon to classify (or recognize) objects to an ever increasing extent. For example, computers may use optical character recognition to classify handwritten or scanned numbers and letters, pattern recognition to classify an image, such as a face, a fingerprint, a fighter plane, etc., or speech recognition to classify a sound, a voice, etc.
Machines have also been called upon to classify textual information objects, such as a textual computer file or document for example. The applications for text classification are diverse and important. For example, text classification may be used to organize textual information objects into a hierarchy of predetermined classes or categories for example. In this way, finding (or navigating to) textual information objects related to a particular subject matter is simplified. Text classification may be used to route appropriate textual information objects to appropriate people or locations. In this way, an information service can route textual information objects covering diverse subject matters (e.g., business, sports, the stock market, football, a particular company, a particular football team) to people having diverse interests. Text classification may be used to filter textual information objects so that a person is not annoyed by unwanted textual content (such as unwanted and unsolicited e-mail, also referred to as junk e-mail, or "spam"). As can be appreciated from these few examples, there are many exciting and important applications for text classification.
.sctn. 1.2.2 KNOWN TEXT CLASSIFICATION METHODS
In this section, some known classification methods are introduced. Further, acknowledged or suspected limitations of these classification methods are introduced. First, rule-based classification is discussed in .sctn. 1.2.2.1 below. Then, classification systems which use both learning elements and performance elements in are discussed in .sctn. 1.2.2.2 below.
.sctn. 1.2.2.1 RULE BASED CLASSIFICATION
In some instances, textual content must be classified with absolute certainty, based on certain accepted logic. A rule-based system may be used to effect such types of classification. Basically, rule-based systems use production rules of the form:
IF condition, THEN fact. PA1 O=the classification output; PA1 nte=the number of training examples; PA1 .alpha..sub.i =Lagrange multiplier training example i; PA1 x.sub.i =feature vector of training example i; PA1 x.sub.j =feature vector of unknown object j; and PA1 y.sub.i =known output of training example i. PA1 O.sub.c =a classification output for category c; PA1 w.sub.c =a weight vector parameter associated with category c; PA1 x=is a (reduced) feature vector based on the unknown textual information object; and PA1 A and B are adjustable parameters of a monotonic (e.g., sigmoid) function.
The conditions may include whether the textual information includes certain words or phrases, has a certain syntax, or has certain attributes. For example, if the textual content has the word "close", the phrase "nasdaq" and a number, then it is classified as "stock market" text.
Unfortunately, in many instances, rule-based systems become unwieldy, particularly in instances where the number of measured or input values (or features or characteristics) becomes large, logic for combining conditions or rules becomes complex, and/or the number of possible classes becomes large. Since textual information may have many features and complex semantics, these limitations of rule-based systems make them inappropriate for classifying text in all but the simplest applications.
Over the last decade or so, other types of classifiers have been used increasingly. Although these classifiers do not use static, predefined logic, as do rule-based classifiers, they have outperformed rule-based classifiers in many applications. Such classifiers are discussed in .sctn. 1.2.2.2 below and typically include a learning element and a performance element. Such classifiers may include neural networks, Bayesian networks, and support vector machines. Although each of these classifiers is known, each is briefly introduced below for the reader's convenience.
.sctn. 1.2.2.2 CIASSIFIERS HAVING LEARNING AND PERFORMANCE ELEMENTS
As just mentioned at the end of the previous section, classifiers having learning and performance elements outperform rule-based classifiers, in many applications. To reiterate, these classifiers may include neural networks (introduced in .sctn. 1.2.2.2.1 below for the reader's convenience), Bayesian networks (introduced in .sctn. 1.2.2.2.2 below for the reader's convenience), and support vector machines (introduced in .sctn. 1.2.2.2.3 below for the reader's convenience).
.sctn. 1.2.2.2.1 NEURAL NETWORKS
A neural network is basically a multilayered, hierarchical arrangement of identical processing elements, also referred to as neurons. Each neuron can have one or more inputs but only one output. Each neuron input is weighted by a coefficient. The output of a neuron is typically a function of the sum of its weighted inputs and a bias value. This function, also referred to as an activation function, is typically a sigmoid function. That is, the activation function may be S-shaped, monotonically increasing and asymptotically approaching fixed values (e.g., +1, 0, -1) as its input(s) respectively approaches positive or negative infinity. The sigmoid function and the individual neural weight and bias values determine the response or "excitability" of the neuron to input signals.
In the hierarchical arrangement of neurons, the output of a neuron in one layer may be distributed as an input to one or more neurons in a next layer. A typical neural network may include an input layer and two (2) distinct layers; namely, an input layer, an intermediate neuron layer, and an output neuron layer. Note that the nodes of the input layer are not neurons. Rather, the nodes of the input layer have only one input and basically provide the input, unprocessed, to the inputs of the next layer. If, for example, the neural network were to be used for recognizing a numerical digit character in a 20 by 15 pixel array, the input layer could have 300 neurons (i.e., one for each pixel of the input) and the output array could have 10 neurons (i.e., one for each of the ten digits).
The use of neural networks generally involves two (2) successive steps. First, the neural network is initialized and trained on known inputs having known output values (or classifications). Once the neural network is trained, it can then be used to classify unknown inputs. The neural network may be initialized by setting the weights and biases of the neurons to random values, typically generated from a Gaussian distribution. The neural network is then trained using a succession of inputs having known outputs (or classes). As the training inputs are fed to the neural network, the values of the neural weights and biases are adjusted (e.g., in accordance with the known back-propagation technique) such that the output of the neural network of each individual training pattern approaches or matches the known output. Basically, a gradient descent in weight space is used to minimize the output error. In this way, learning using successive training inputs converges towards a locally optimal solution for the weights and biases. That is, the weights and biases are adjusted to minimize an error.
In practice, the system is not trained to the point where it converges to an optimal solution. Otherwise, the system would be "over trained" such that it would be too specialized to the training data and might not be good at classifying inputs which differ, in some way, from those in the training set. Thus, at various times during its training, the system is tested on a set of validation data. Training is halted when the system's performance on the validation set no longer improves.
Once training is complete, the neural network can be used to classify unknown inputs in accordance with the weights and biases determined during training. If the neural network can classify the unknown input with confidence, one of the outputs of the neurons in the output layer will be much higher than the others.
To ensure that the weight and bias terms do not diverge, the algorithm uses small steps. Furthermore, the back propagation (gradient descent) technique used to train neural networks is relatively slow. (See, e.g., the article: Schutze, et al., "A Comparison of Classifiers and Document Representations for the Routing Problem", International ACM SIGIR Conference on Research and Development in Information Retrieval, Section 5 (1995) (Hereafter referred to as "the Schutze article"). Consequently, convergence is slow. Also, the number of neurons in the hidden layer cannot easily be determined a priori. Consequently, multiple time-consuming experiments are often run to determine the optimal number of hidden neurons.
.sctn. 1.2.2.2.2 BAYESIAN NETWORKS
Having introducing neural networks above, Bayesian networks are now briefly introduced. Typically, Bayesian networks use hypotheses as intermediaries between data (e.g., input feature vectors) and predictions (e.g., classifications). The probability of each hypothesis, given the data ("P(hypo.vertline.data)"), may be estimated. A prediction is made from the hypotheses using posterior probabilities of the hypotheses to weight the individual predictions of each of the hypotheses. The probability of a prediction X given data D may be expressed as: ##EQU1##
where H.sub.i, is the i.sup.th hypothesis. A most probable hypothesis H.sub.i that maximizes the probability of H.sub.i given D (P(H.sub.i.vertline.D)) is referred to as a maximum a posterior hypothesis (or "H.sub.MAP ") and may be expressed as follows: EQU P(X.vertline.D).apprxeq.P(X.vertline.H.sub.MAP)
Using Bayes' rule, the probability of a hypothesis H.sub.i given data D may be expressed as: ##EQU2##
The probability of the data D remains fixed. Therefore, to find H.sub.MAP, the numerator must be maximized.
The first term of the numerator represents the probability that the data would have been observed given the hypothesis i. The second term represents the prior probability assigned to the given hypothesis i.
A Bayesian network includes variables and directed edges between the variables, thereby defining a directed acyclic graph (or "DAG"). Each variable can assume any of a finite number of mutually exclusive states. For each variable A, having parent variables B.sub.1, . . . B.sub.n, there is an attached probability table (P(A.vertline.B.sub.1, . . . B.sub.n). The structure of the Bayesian network encodes the assumptions that each variable is conditionally independent of its non-descendants, given its parent variables.
Assuming that the structure of the Bayesian network is known and the variables are observable, only the set of conditional probability tables need be learned. These tables can be estimated directly using statistics from a set of learning examples. If the structure is known but some variables are hidden, learning is analogous to neural network learning discussed above.
An example of a simple Bayesian network is introduced below. A variable "MML" may represent a "moisture of my lawn" and may have states "wet" and "dry". The MML variable may have "rain" and "my sprinkler on" parent variables each having "Yes" and "No" states. Another variable, "MNL" may represent a "moisture of my neighbor's lawn" and may have states "wet" and "dry". The MNL variable may share the "rain" parent variable. In this example, a prediction may be whether my lawn is "wet" or "dry". This prediction may depend of the hypotheses (i) if it rains, my lawn will be wet with probability (x.sub.1) and (ii) if my sprinkler was on, my lawn will be wet with probability (x.sub.2). The probability that it has rained or that my sprinkler was on may depend on other variables. For example, if my neighbor's lawn is wet and they don't have a sprinkler, it is more likely that it has rained. An example of a Bayesian network associated with the "wet grass" problem is presented in the text: Jensen, An Introduction to Bayesian Networks, pp. 22-25, Spinger-Verlag, New York (1997).
As discussed above, the conditional probability tables in Bayesian networks may be trained, as was the case with neural networks. Advantageously, by allowing prior knowledge to be provided for, the learning process may be shortened. Unfortunately, however, prior probabilities for the conditional probabilities are usually unknown, in which case a uniform prior is used.
.sctn. 1.2.2.2.2.3 SUPPORT VECTOR MACHINES
Support vector machines (or "SVMs") are another type of trainable classifier. SVMs are reportedly more accurate at classification than naive Bayesian networks in certain applications, such as text classification for example. (See, e.g., the article, Joachims, "Text Categorization with Support Vector Machines: Learning with Many Relevant Features", LS-8, Report 23, University of Dortmund Computer Science Department (November 1997).) They are also reportedly more accurate than neural networks in certain applications, such as reading handwritten characters for example. (See, e.g., the article, LeCun et al. , "Learning Algorithms for Classification: A Comparison on Handwritten Digit Recognition," Neural Networks: The Statistical Mechanics Perspective, Oh et al. (Eds.), pp. 261-276, World Scientific (1995). Unfortunately, however, SVMs reportedly take longer to train than naive Bayesian classifiers. A new method and apparatus to build (or train) SVMs in an efficient manner is disclosed in U.S. patent application Ser. No. 09/055,477, by John Platt, entitled "Methods and Apparatus for Building a Support Vector Classifier", filed on Apr. 6, 1998 and incorporated by reference.
Although SVMs are known to those skilled in the art, their theory and operation will be introduced for the reader's convenience.
An object to be classified may be represented by a number of features. If, for example, the object to be classified is represented by two (2) features, it may be represented by a point in two (2) dimensional space. Similarly, if the object to be classified is represented by n features, also referred to as a "feature vector", it may be represented by a point in n-dimensional space. The simplest form of an SVM defines a plane in the n-dimensional space (also referred to as a hyperplane) which separates feature vector points associated with objects "in a class" and feature vector points associated with objects "not in the class". A number of classes can be defined by defining a number of hyperplanes. The hyperplane defined by a trained SVM maximizes a distance (also referred to as an Euclidean distance) from it to the closest points (also referred to as "support vectors") "in the class" and "not in the class". A hyperplane is sought which maximizes the distances between the support vectors and the hyperplane, so that the SVM defined by the hyperplane is robust to input noise. The hyperplane (or hypersurface) is defined by a training process, some of which are discussed in .sctn. 4.2.1.4.1 below.
.sctn. 1.2.2.2.4 CHALLENGES TO USING CLASSIFIERS HAVING LEARNING AND PERFORMANCE ELEMENTS FOR TEXT CLASSIFICATION
Although, as discussed above, rule-based classifiers are feasible in only the simplest text classification applications, some significant challenges exist when using systems having learning and performance ELEMENTS (ALSO REFERRED TO AS "LEARNING MACHINES") FOR text classification. Some of the more significant challenges will be introduced in .sctn..sctn. 1.2.2.2.4.1 through 1.2.2.2.4.3 below.
.sctn. 1.2.2.2.4.1 FEATURE VECTOR SIZE
When training learning machines for text classification, a set of learning examples are used. Each learning example includes a vector of features associated with a textual information object. In some applications, such feature vectors may have on the order of 10.sup.8 features. A large number of features can easily be generated by considering the presence or absence of a word in a document to be a feature. If all of the words in a corpus are considered as possible features, then there can be millions of unique features. For example, web pages have many unique strings and can generate millions of features. An even larger number of features are possible if pairs or more general combinations of words or phrases are considered, or if the frequency of occurrence of words is considered. The number of features in a feature vector may be reduced by so-called "feature reduction" or "feature selection" methods such that a reduced feature vector, having a subset of the features of the original feature vector, is produced. Indeed, some believe that for learning machine text classifiers to be feasible, feature selection is needed.
Known feature selection techniques include DF-Thresholding (See, e.g., the paper: Yang and Peterson, "A Comparative Study on Feature Selection in Text Categorization," International Conference on Machine Learning (1997); hereafter referred to as: the Yang-Peterson article"), Chi-Squared Test (See, e.g., the Schutze article), Term Strength Criterion (See, e.g., the article: Yang and Wilbur, "Using Corpus Statistics to Remove Redundant Words in Text Categorization," Journal of the American Society for Information Science, Vol. 47, No. 5, pp. 357-369 (1996); hereafter referred to as "the Yang-Wilbur article"), Information Gain Criteria (See, e.g., the Yang-Peterson article), correlation coefficient which is the square root of the Chi-Squared measure (See, e.g., the article: Ng, et al, "Feature Selection, Perceptron Learning, and a Usability Case Study for Text Categorization", Proceedings of SIGIR'97, p.67-73 (1997) and Latent Semantic Indexing (or "LSI") using Singular Value Decomposition (or "SVD") which is a technique that represents features by a low-dimensional linear combination of orthogonal indexing variables (See, e.g., Schutze, et al).
However, some have concluded that feature selection should be avoided. For example, the Joachims article, which espouses the use of SVMs for text classification, concludes that feature selection is likely to hurt performance due to a loss of information. (See, e.g., the Joachims article, page 3.) The Joachims article further concludes that since SVMs can generalize well in high dimensional feature spaces, they eliminate the need for feature selection thereby making text categorization easier. (See, e.g., the Joachims article, page 11.) Thus, while some in the art have found feature selection useful, when SVMs are used for text classification, feature selection has been avoided in some instances.
.sctn. 1.2.2.2.4.2 OVERFITTING
When a learning machine is trained, it is trained based on training examples from a set of feature vectors. In general, performance of a learning machine will depend, to some extent, on the number of training examples used to train it. Even if there are a large number of training examples, there may be a relatively low number of training examples which belong to certain categories.
Many learning machines utilize sigmoid functions (that is, s-shaped, monotonic functions) to determine an output (e.g., a determination as to whether an object is in a category or not) based on an input (e.g., a feature vector of an unknown object). Unfortunately, when there is little training data available for a given category, the parameters of the sigmoid function are not well defined. Consequently, when there is little training data available for a given category, the sigmoid function can be overfit to past data. Such overfitting can lead to unwarranted confidence in the output of the learning machine and should be avoided.
.sctn. 1.2.2.2.4.3 CLASSIFICATION SPEED
Training time and response (e.g., classification) time are two (2) important characteristics of a learning machine. For example, even those that espouse the use of learning machines, and SVMs in particular, for classifying text concede that the training time of SVMs is longer than other methods. This challenge is addressed in the U.S. patent application Ser. No. 09/055,477, by John Platt, entitled "Methods and Apparatus for Building a Support Vector Classified", filed on Apr. 6, 1998 and assigned to a common assignee (incorporated by reference) and will not be addressed in detail here.
There remains the issue of response (or classification) time. The present inventors believe that the SVM classifier discussed in the Joachims paper classifies unknown objects in accordance with the following expression: ##EQU3##
where:
Using expression (1) to determine an output is relatively slow. Naturally, a faster classifier would be desirable.
The Schutze article discusses logistic regression classifier which employs a weight vector (.beta.) derived using a maximum likelihood and the Newton-Raphson method of numerical optimization. Although it is believed that this classifier is able to classify objects faster than the SVM classifier discussed in the Joachims paper, it is believed that the weight vector (.beta.) determined does not provide the best classification results.