The present invention relates to the field of computerized information search and retrieval systems. More specifically, this invention relates to a method and apparatus for setting and updating the score threshold of a user profile.
Given the vast amount of information accessible by computer systems, particularly on distributed databases, more efficient methods of information retrieval are continually needed. Often the use of search tools returns a large volume of data, much of which may not be relevant to the user""s ultimate needs. The user is forced to parse through large volumes of information to find ultimately that which is relevant. It is therefore desirable to develop a system whereby a corpus or a dynamic stream of documents is sufficiently filtered such that only relevant information is returned to the user.
Profile-based filtering involves the interaction of a document or group of documents with a user profile. A stream of incoming documents is compared with certain criteria, contained in a user profile, and then either rejected or ultimately provided to the user. Conceptually, a user profile (i.e., a binary document classifier) consists of three key elements: a term vector, inverse document frequency or xe2x80x9cIDFxe2x80x9d statistics, and a score threshold. The first two elements are used to assign a score to the document, and the third is used to make the decision of whether to accept or reject the document as relevant or not relevant to the user""s search parameters. The process of profiling is distinct from database searching in that profiling evaluates and selects or rejects individual documents as they stream in rather than evaluating all documents of a database and then selecting the best scoring ones as in traditional database searching.
The basic approach to profile-based filtering involves a two-step procedure. For each document-profile pair, a relevance score is computed. That score is then applied to a profile score threshold to make the binary decision to accept or reject the document for the profile. It is important that the profile score threshold be low enough such that it allows sufficient amounts of relevant documents to be returned to the user. However, if the profile score threshold is set too low, a large number of documents will be returned, necessitating further filtering. For any user profile, the optimal threshold should represent the best tradeoff between accepting more relevant documents and avoiding accepting non-relevant documents, where the best tradeoff is determined by the user""s utility preference.
Setting the profile score threshold can be divided into two separate parts: (a) an initial score threshold setting, before there are any relevance judgments from the user, and (b) updating the score threshold, at any point when relevance judgments are fed back into the system. Updating the profile score threshold adapts the filtering process to the user""s specific requirements and thus provides a more effective means of information retrieval.
Consequently, in view of the need for more efficient searching techniques and filtering methods, a method by which the profile score threshold may be initially set and then updated during use is highly desirable. A properly set profile score threshold enables the user to search a group of documents in a comprehensive manner, such that fewer relevant documents are missed by the user, but likewise may prevent the user from becoming inundated with a large number of documents.
An approach for initially setting the profile score threshold and updating the profile score threshold during use in a profile-based filtering system is described. The initial threshold is set based on an expected acceptance ratio of documents specified by the user. To set an initial threshold, a set of reference documents (i.e., a reference database) is selected. Each reference document is scored against the profile and all the reference documents are sorted by their scores. The initial threshold is then set to such a score that the ratio of reference documents with a score above it and those with a score below it equals the expected acceptance ratio. When user relevance feedback is available, the threshold can be updated based on a specific utility function specified by the user. To update a threshold, first a set of historical example documents is identified for any profile. Each example document is scored against the profile and all the example documents are sorted by their scores. Assuming each example document score as a possible candidate threshold, a utility value can be computed for the candidate threshold. Using the utilities at each candidate threshold, the point of highest utility and the point of zero utility are then determined. An updated utility threshold is then calculated by interpolating between the threshold at the point of highest utility and the threshold at the point of zero utility, according to the formulas disclosed herein. The updated utility threshold is then used for subsequent information retrieval.