The task of electronic document filtering involves determining which documents in a stream of incoming documents match which queries in a collection of user queries. The user queries generally consist of one or more search terms (or document properties) connected by one or more Boolean operators. This task is alternatively referred to as "selective dissemination of information," or as the "inverted query problem."
Two main types of retrieval models are used in filtering systems. The first type is a simple and intuitive Boolean retrieval model, also called the "exact match" model, in which the user queries are expressed in Boolean logic, and a document either does or does not match a given query. The second type is the probabilistic model, in which the retrieval function produces a ranking of the documents, in an order based on the estimated relevance of each document to the query.
The primary difficulties in document filtering arise from the massive scale of queries to be evaluated against the high frequency of incoming documents to be filtered. For example, news filtering on the Internet may involve dealing with a stream of one or more documents per second, with each document being filtered against millions of user queries. Comparing each document against each query is impractical, as providing the hardware capable of performing acceptable throughput is cost-prohibitive.
As a result, known filtering systems reduce the time taken to examine a document by first eliminating all queries which are irrelevant to a given document. For example, a relatively rapid test can be performed to eliminate any monotone queries (queries which do not contain non-monotone operators such as negation) which consist solely of words which are not in the document. Moreover, statistical techniques can speed up the detection of a query's relevance or irrelevance to a particular document, such as by first searching the document with the least frequent trigram (three-character string) in each query to first eliminate queries when documents do not contain that trigram.
However, after the irrelevant queries have been eliminated, the document still must be tested against the (possibly large) number of remaining queries, and such testing is relatively slow. Moreover, if non-monotone operators are allowed, initially eliminating queries having non-monotone operators significantly complicates the pre-filtering elimination process. Lastly, probabilistic filtering approaches are even slower than exact-match approaches, and thus even more difficult to scale to large systems.