The present invention relates to information retrieval. In particular, the present invention relates to evaluating the relevance of document transmissions that potentially consist of a variety of topics.
The primary purpose of the invention is to help people deal with information overload. With the increasing development of communications technology, it is possible for people to feel the opposing forces of being, on the one hand, highly dependent on critical information, and on the other hand, overloaded with information to the extent that there is a need to reduce exposure to the flood of information. As result of this conflict, people may find themselves in the position of needing to examine quickly large numbers of documents, with a significant penalty for missing critical information contained in those documents.
Various established tools exist for measuring the importance of documents to an individual. This technology, often referred to as relevance technology, allows a computer to make judgments about the importance to an individual of news articles, technical articles, mail messages or the like. This technology has proven useful for categorization and prioritization of presentation, both of which are necessary to help a user deal with a flood of information. But because of the inherent uncertainty of the relevance measure, the user who needs information prioritized still must spend time perusing many documents. Documents that are rated as highly relevant must be perused to see what, if any, useful information they contain. Documents that are rated as mildly relevant or less must be perused to make sure that nothing important is missed. Thus, nearly every article needs to be examined in some depth.
Existing relevance technology assumes that documents are homogeneous in content and relevance, and so a single relevance value is calculated for an entire document. This is because the technology was developed initially for relatively short documents such as wire-service items. As documents become longer and more varied in content, a single relevance value may be affected by separate sections of the document that contain references to unrelated topics, including some that are highly relevant and others of little or no interest to the user. This variability in content means that a single relevance number may result in either false-negative or false-positive evaluations. The only safe strategy for a reader of larger documents is to read most of the document, regardless of relevance evaluations.
Currently, either an entire document or a selected sub-set is evaluated for relevance. This can have the effect of causing the relevance to be misjudged. This misjudging of relevance can take various forms. For example, the relevance evaluation can be diluted if two unrelated sections of the document are evaluated together. This is because one section may be highly relevant while the other section contains material that results in a negative evaluation. In general, the user would want to be apprised of the relevant material, even when surrounded by irrelevant material. An example of this is the xe2x80x9cWhat""s Newsxe2x80x9d section in the Wall Street Journal. This article typically contains several unrelated items that should, logically, be evaluated separately. For example, the first paragraph of the xe2x80x9cWhat""s Newsxe2x80x9d section might focus on the topic xe2x80x9cEndangered Species,xe2x80x9d while the second paragraph of the xe2x80x9cWhat""s Newsxe2x80x9d section might focus on the topic xe2x80x9cGulf War Syndrome,xe2x80x9d and subsequent paragraphs might focus on topics entirely unrelated to any others. Therefore, while one might find the xe2x80x9cEndangered Speciesxe2x80x9d discussion highly relevant to one""s needs, the entire document might not receive a high relevance value due to dilution from other topics.
In many cases, rather than evaluating the entire document, known relevance algorithms may evaluate only the first paragraph of each document. This is justified by the general understanding that news material is usually written in a particular style that insures that the relevant material is near the beginning of the document. Of course, this is not true of articles like xe2x80x9cWhat""s News.xe2x80x9d Therefore, again using xe2x80x9cWhat""s Newsxe2x80x9d as an example, if only the first paragraph of a document is examined for relevance to xe2x80x9cEndangered Species,xe2x80x9d then the document discussed above would receive a maximal relevance value even though only one paragraph discusses xe2x80x9cEndangered Species.xe2x80x9d If only the first paragraph of the article discussed above is examined for relevance to xe2x80x9cGulf War Syndrome,xe2x80x9d the document will receive a minimal relevance value even though the article does, in fact, discuss this topic.
The present invention introduces systems and methods for evaluating the relevance of transmitted data. In one embodiment of the present invention, a topic and a document are received, and the document is divided into various pieces. The relevance of each piece is evaluated with respect to the received topic, and these individual evaluations are combined into a surrogate representation of the relevance.