Users who wish to find relevant and up-to-date information from sources of data such as the Internet face a continuous deluge of new content. By grouping like content together, the task of sorting through this large amount of data can be simplified.
Existing technology has been used to automatically separate the content of a web based original document. An article to Lin et al. entitled “Discovering Informative Content Blocks from Web Documents” describes a process of automatically removing redundant data from meaningful content from web text. The goal of this article is to separate meaningful data from redundant, repetitive and usually un-interesting data appearing on web pages.
Once the redundant data has been stripped from the page, the text content of the web page can be classified using known indexing techniques. The indexed web pages can then be evaluated by existing web search engines such as the GOGGLE, MSN or YAHOO search engines. The Lin et al article discards as irrelevant portions of the web pages deemed to have redundant data, but does not change the indexing or evaluation of text pages found to have meaningful information.
A publication to Watters et al. entitled “Rating News Documents for Similarity” concerns a personalized delivery system for news documents. This publication discusses a methodology of associating news documents based on the extraction of feature phrases, where feature phrases identify dates, locations, people, and organizations. A news representation is created from these feature phrases to define news objects that can then be compared and ranked to find related news items.
In the context of the larger search problem, the current invention provides a means whereby users can quickly browse through a large collection of information and spot those items that are of interest to them by presenting only the content that is conceptually distinct.