Given the vast amount of information accessible by computer systems, particularly on distributed databases, more efficient methods of managing information are continually needed. One general type of information management is exemplified by information filtering. In information filtering, a system retrieves or presents to a user a limited number of documents (or information) that satisfies a user's requirements. In one form of information filtering, documents are processed in a stream or in batches as they are delivered to the system, perhaps continuously. In contrast to database searching, in which all the documents in the database are available at once and can be evaluated and ranked against one another before being presented to the user, in this form of filtering documents are evaluated by the system one at a time or in small sets and a binary decision to accept (i.e., present the document to the user in this case) or reject (i.e., do not present the document to the user) a document is made for each individual document. In document filtering it is often the case that an information need is not expressed directly, but rather is presented to the system as a set of example documents that either satisfy or that do not satisfy the information need. This set can be regarded as representing a topic or category or class. The system must process this set of documents to derive an information need model. Typically, information needs can be represented as a binary document classifier model that can be learned from both the positive and negative examples. When making a filtering decision, the system compares the information in a document with the information contained in the derived (also referred to as the learnt) information model, which serves to represent the user's need or topic. In the literature, information filtering is also commonly known as text classification. In this disclosure we use the term information need to characterize both information filtering and text classification, where the user of the system can be a human or another computer system.
Information needs can be represented computationally using a variety of representations known as models. Generally, the information management system characterizes computationally a document and information need model using a vector space representation. Additional background relating to the generation of vector space models is disclosed in U.S. Pat. No. 5,619,709, incorporated herein by reference and by Salton et al., in “A Vector Space Model For Information Retrieval”, Journal of the ASIS, 18:11, 613-620, November 1975.
Under the vector-space representation, documents can be conceptually viewed as a list of tuples (also known as a vector of features). The document vector D consists of a list of tuples of the form <fi, wi>, where fi denotes a feature (such as a word, a noun phrase, or any linguistically or otherwise derived feature) and wi denotes the weight associated with feature fi. Here, the weight value wi can be simply a binary value where a value of 1 denotes that the feature occurred in the document and a value of 0 denotes that the feature has not occurred in the document. Alternative weighting schemes, such as TF_IDF, which will be presented below, can also be used. Information needs can also be represented using a vector space representation.
Modeling information needs using automatic techniques is a very active area of research and application in information management. Recently, support vector machines (SVMs), which provide excellent performance on a variety of learning problems including handwritten character recognition, face detection, have been applied to modeling information needs. Their performance for modeling information needs, while being competitive with other approaches (e.g., Rocchio, naïve Bayes) from a precision perspective, is not competitive from a recall perspective.
An SVM model can be viewed geometrically as a hyperplane (or hypersurface) that partitions two classes of objects in a multi-dimensional feature space into two disjoint subsets; in our case the hyperplane partitions documents into a positive set corresponding to documents that satisfy an information need and into a negative set corresponding to documents that do not satisfy an information need. Mathematically, a linear SVM (non-linear SVMs will be presented subsequently) can be represented in the following two equivalent forms: using a weight vector representation; or using a support vector representation. The weight vector representation mathematically represents an SVM (the separating hyperplane) as a pair of parameters <W, b>, where W denotes a weight vector and b represents a threshold or bias term. The weight vector W consists of a list of tuples of the form <fi, wi>, where fi denotes a feature and wi denotes the weight associated with feature fi. This corresponds to a vector space representation of the weight vector W. Here, the weight value wi associated with each feature fi and the threshold value b are learned from examples using standard SVM learning algorithms. This weight vector representation is also known as the primal representation. The support vector representation of an SVM model, also known as the dual representation, mathematically represents an SVM (the separating hyperplane) as a pair of parameters <SV, b>, where SV denotes a list of example tuples, known as support vectors, and b represents a threshold. The support vector list consists of tuples of the form <SVi, αi>, where SVi denotes an example document with known classification and αi denotes the weight associated with example SVi. The Euclidean (perpendicular) distance from the hyperplane to the support vectors is known as the margin of the support vector machine.
The parameters of the support vector machine model are determined using a learning algorithm in conjunction with a training data set that characterizes the information need, i.e., a list of documents that have been labeled as positive or negative. Some of these training algorithms are discussed below. Abstractly, learning a linear SVM model reduces to determining the position and orientation of the hyperplane that separates the positive examples and negative examples that are used during learning. This reduces to determining the parameters of the weight vector representation or the support vector representation. Learning a support vector machine can be viewed both as a constraint satisfaction and optimization algorithm, where the first objective is to determine a hyperplane that classifies each labeled training example correctly, and where the second objective is to determine the hyperplane that is furthest from training data. Classifying an example using an SVM model reduces to determining which side of the hyperplane the example falls. If the example falls on the positive side of the hyperplane then the example is assigned a positive label; otherwise it is assigned a negative label. This form of learnt SVM is known as a hard SVM. Other flavors of SVM exist which relax the first objective. For example, not requiring all training examples to be classified correctly by the SVM leads to a flavor known as soft SVMs. In this case the SVM learning algorithm trades-off accuracy of the model with the margin of the model. Other types of SVMs and SVM learning algorithms also exist, some of which are discussed below.
Theoretically, generalization error of SVMs (the error of a model on unseen data) can be upper-bounded as a function of the margin, the distance from the nearest data points (known as the support vectors) to the separating decision surface (hyperplane). As a result SVM learning algorithms focus on finding the model that maximizes the margin. This approach to model selection is entirely data driven and provides state-of-the-art performance in dense data problems. However, in domains such as modeling information needs, that are sparse and where classes are unevenly distributed and poorly represented, this approach to learning can lead to an over fitting of the more frequent class. In other words, these learning algorithms lead to models with excellent precision but poor recall. Various approaches, known as relaxation approaches, have been proposed to counter this problem. These approaches can be characterized as follows: those that embed the relaxation strategy within the learning algorithm; those that treat relaxation as a wrapper around the learning step; and those that view relaxation as a post-learning thresholding step. We examine each category here highlighting its main characteristics, its strengths and its weaknesses.
Embedded relaxation strategies build on soft SVMs (SVM models that balance low error rate with large margin), whereby different or asymmetric costs are associated with false negatives and false positives. To counter the imbalance of positive training documents to negative training documents, a higher cost is associated with the misclassification of positive documents than with negative documents thereby resulting in an SVM model that balances the tradeoff between large margin and few errors (in particular, few positive misclassifications). This generally results in a hyperplane that is different in orientation and position than the separating hyperplane resulting from a hard SVM formulation. Tuning the misclassification cost asymmetrically can provide significant improvement, though this process can be prohibitively expensive.
A second generic approach for relaxation is done outside the learning step but uses the learning algorithm as the evaluation function to choose between different levels of relaxation. This category subsumes uneven margin-based learning e.g., using positive and negative margins. The degree of the positive and negative margin is empirically determined using cross validation from a set of possible margins. Uneven margin based learning results in separating hyperplanes that have different orientation and position to the hyperplane resulting from a standard hard or soft SVM learning algorithm. Uneven margin-based learning has, to date, been limited to text document routing/ranking problems. Within this application field it has yielded a state-of-the-art performance. This approach to relaxation can also prove to be expensive, as training is composed of cross validation, where each cross validation step requires the training of a separate SVM, and a final SVM training step (that uses the cross validation determined margin settings).
The final relaxation category involves strategies that are applied after the SVM model has been learned. This category of approaches is based on thresholding the output value (or margin) of the learnt SVM. This is generally an inexpensive one-dimensional optimization problem that can lead to significant improvement in performance measures. This approach results in changing the position (or a translation) of the separating hyperplane but does not change the orientation of the hyperplane (that results from hard or soft SVM learning). It should be noted that this thresholding step is generic in nature. Therefore, we can replace our SVM, as a core model, with any of a number of popular types of information model, such as Rocchio models, or probabilistic models, whose outputs are document scores and probabilities, respectively.
The critical step in thresholding is to determine the value, known as the threshold, at which a decision changes from labeling a document as positive to labeling a document as negative with respect to an information need. Many of the approaches to thresholding that have been developed in other fields and for other purposes, such as information retrieval (IR), can be applied directly in thresholding the score output of SVMs. Though thresholding has received a lot attention in the IR sub-field of adaptive filtering, optimizing thresholds remains a challenging problem. The main challenge arises from a lack of labeled training data. Due to limited amounts of training data, standard approaches to IR modeling use the same data for both model fitting (learning) and threshold optimization. Consequently, this often biases the threshold to high precision, i.e., the threshold overfits the training data. A beta-gamma thresholding algorithm, proposed to counter this problem, and its background are described in greater detail in U.S. Pat. Nos. 6,430,559, 6,463,434, 6,535,876, and 6,587,850, which are herein incorporated by reference. Other approaches have also been used. Yang proposed and presented various score-based thresholding algorithms for text categorization (See, e.g., the article, Yang Y., A study on thresholding strategies for text categorization, Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'01), 2001, 137-145). Zhang and Callan presented a maximum likelihood estimation of filtering thresholds. (Y. Zhang and J. Callan. “YFilter at TREC-9”. In Proceedings of the Ninth Text REtrieval Conference (TREC-9), (pp. 135-140). National Institute of Standards and Technology, 2001, special publication 500-249). Arampatzis proposed a score-distributional-based threshold optimization algorithm (See, e.g., the article, Arampatzis A., Unbiased S-D Threshold Optimization, Initial Query Degradation, Decay, and Incrementality, for Adaptive Document Filtering, Tenth Text Retrieval Conference (TREC-2001), 2002, 596-605). Ault and Yang proposed a margin-based local regression for predicting optimal thresholds for adaptive filtering (See, e.g., the article, Ault T., Yang Y., kNN, Rocchio and Metrics for Information Filtering at TREC-10, Tenth Text Retrieval Conference (TREC-2001), 2002, 84-93).
Some of these IR approaches have been adapted already for adjusting the threshold of an SVM. For example, Cancedda et al. (See, e.g., the article, Cancedda N. et al., Kernel Methods for Document Filtering, Eleventh Text Retrieval Conference (TREC-11), 2003) report one such approach to adjusting the threshold of SVMs based upon a Gaussian modeling process of the SVM scores (output values) for positive and negative documents for each category. This Gaussian model is then used to generate sample document scores and an optimal threshold is set to the score corresponding to optimal utility. This approach, combined with asymmetric learning, has led to huge improvements in recall and precision, though it is hard to discern how much improvement can be attributed to the asymmetric cost learning strategy or to the thresholding strategy.
The advantage of using thresholding in combination with SVMs is that a user utility model can be naturally incorporated into the learning phase. Being able to incorporate a utility model into learning has been demonstrated to improve system performance as measured in terms of standard information retrieval measures (See, e.g., the article, Shanahan J. G., Roma N., Improving SVM Text Classification Performance through Threshold Adjustment, European Conference on Machine Learning (ECML) 2003).
It would therefore be advantageous to develop alternative threshold adjustment schemes for boosting the performance of SVMs that incorporates utility models.