Although statistical retrieval models are now accepted widely, there has been little research on how to adapt them to the demands of high-speed document filtering. The problems of document retrieval and document filtering are similar at an abstract level, but the architectures required, the optimizations that are possible, and the quality of the information available, are all different.
Retrieval of documents from an archival collection (retrospective retrieval) and filtering documents from an incoming stream of documents (document filtering or selective dissemination of information) have been described as two sides of the same coin. Both tasks consist of determining quickly how well a document matches an information need. Many of the underlying issues are the same; for example, deciding how to represent each document, how to describe the information need in a query language, what words to ignore (e.g., stop words), whether or not to stem words, and how to interpret evidence of relevance.
Many document filtering techniques are based on the assumption that effective document retrieval techniques are also effective document filtering techniques. However, when filtering research is conducted with a retrieval system, important issues can be overlooked. Different architectures are possible, and perhaps required, to rapidly compare persistent information needs to transient documents. A filtering algorithm must make decisions based upon incomplete information; it may know what has happened in the past, but it generally cannot know, nor can it generally wait to know, what documents will be seen in the near future. Traditional corpus statistics, such as inverse document frequency (idf), have different characteristics when documents are encountered one-at-a-time. These issues are important, because they determine how efficient and effective statistical document filtering systems will be in "real world" environments.
Document filtering, also known as selective dissemination of information (SDI) is generally based on an unranked Boolean retrieval model. A user's information need is expressed by a query, also called a profile, in a query language. Sometimes a profile is actually a set of queries for one user; in this discussion, query and profile are considered synonymous. Queries are typically expressed for the purposes of this discussion using Boolean logic. A query either matches or does not match a document. There is no ability to partially satisfy a query, or to determine how well a document matches or satisfies a query. Instead, the emphasis is on speed, and on indexing methods that enable very fast processing of documents against profiles.
In one example of these classes of systems, each Boolean profile is analyzed to identify the least frequent trigram (LFT) that must occur whenever the profile matches a document (a necessary, but not sufficient, condition for matching). Documents are converted into a restricted alphabet, and represented as a sequence of trigrams. For each profile, a table lookup determines whether its LFT is present. If not, the profile can not possibly match the document. This first stage is designed to eliminate greater than 95% of the profiles in just a few instructions each. If a profile's LFT is present, a slower second stage determines whether the document actually satisfies the Boolean query.
It is generally accepted that statistical systems provide better precision and recall for document retrieval than do unranked Boolean systems. The growing power of computer hardware has made statistical systems increasingly practical for even large scale document filtering environments. A common approach has been to simulate document filtering with an existing vector-space or probabilistic document retrieval system on a collection of new or recent documents. This approach is simple, effective, and has the advantage of a corpus from which to gather statistics like idf. However, it is not well-suited to immediate dissemination of new information, and it adds index creation, storage, and maintenance to the cost of document filtering.