Given the vast amount of information accessible by computer systems, particularly on distributed databases, more efficient methods of information retrieval are continually needed. One general type of retrieval is exemplified by information filtering. In information filtering, a system retrieves or presents to a user just the limited number of documents (or information) that satisfies a user's requirements. In one form of information filtering, documents are processed in a stream or in batches as they are delivered to the system, perhaps continuously. In contrast to database searching, in which all the documents in the database are available at once and can be evaluated and ranked against one another before being presented to the user, in this form of filtering documents are evaluated by the system one at a time or in small sets and a binary decision to accept or reject a document is made for each individual document. In making such a decision, the system compares the information in a document with the information contained in an information profile, which serves to represent the user's need or topic.
Traditionally, an information profile (used as the basis for a binary document classifier) consists of three key elements: (1) a set of features, often in the form of a term vector, (2) a set of one or more weights or statistics associated with each feature, (3) a scoring threshold. The first two elements are used by the system with a scoring function to assign a score to the document, and the third is used to make the decision whether to accept or reject the document as relevant or not relevant to the user's information need.
The basic approach to profile-based filtering involves a two-step procedure. For each document-profile pair, a relevance score is computed. Conventionally, the computation is based on a weighted combination of the features or terms in the profile as they appear in the document. That score is then compared to the profile score threshold to make the binary decision to accept or reject the document. It is important that the profile score threshold be low enough such that it allows a sufficient number of relevant documents to be returned to the user. However, if the profile score threshold is set too low, a large number of documents will be returned, potentially inundating the user with redundant or irrelevant information. For any information profile, the optimal threshold should represent the best tradeoff between accepting more relevant documents and avoiding accepting non-relevant documents. Such an optimal tradeoff is typically based on the user's stated or assumed utility preference.
In document filtering it is often the case that the user's information need is not expressed directly, as would be the case in a query to an information-retrieval system, but rather is presented to the system as a set of example documents. The set can be regarded as representing a topic or category or class. The system must process the set of documents to derive an information profile. Such a situation presents two problems. First, features must be extracted from the documents and associated with weights. Second, a threshold must be set, based on a scoring function, to establish the basis for accepting or rejecting new documents that are scored against the features in the profile.
Using a set of example documents to find appropriate features and weights and to set a scoring threshold is called the “training” problem. Training typically proceeds in several steps. In the first step, the available example documents are divided into two or more subsets. One of the subsets is processed to find candidate features and weights, forming an initial, partial profile. Another subset is mixed with additional example of documents believed not to be on the example topic. In a second step, this set of mixed documents is passed to the profile created in the first step and each document in the mixed set is given a score. Since the correct (on-topic) documents in the mixed set are known, the performance of the profile can be evaluated.
Successful training of the type outlined above depends on several factors. One important factor is related to the identification of features in the sample documents. It is desirable that the sample documents be highly representative of the topic or category and that the features extracted from the documents be highly likely to occur in all and only the documents on the same topic or in the same category. Difficulties in training arise especially when the example documents do not accurately represent the documents that the system will have to process in the future. This may happen when there are too few examples. This may also happen when the topic or category itself is not well focused.
In general, it is difficult to use example documents for optimal training, to select representative features, to set score thresholds, and to validate the performance of the profile. It is highly desirable to have a principled and automatic method for deciding how best to create a profile and configure a filter based on it.