The addition of non-textual content such as images, audio clips and video clips complements textual descriptions by engaging a user's senses in the presentation of the information described in the text. Displaying such non-textual content with links to the text also serves to attract the attention of users who may have an interest in information of the type described in the text. A content publisher may receive non-textual content of this type from media sources independently of related textual information.
Users who wish to find relevant and up-to-date information from sources of data such as the Internet face a continuous deluge of new content. By grouping like content together, the task of sorting through this large amount of data can be simplified.
Existing technology has been used to automatically separate the content of a web based original document. An article to Lin et al. entitled “Discovering Informative Content Blocks from Web Documents” (SIGKDD '02, Jul. 23-26, 2002 Copyright 2002 ACM) describes a process of automatically removing redundant data from meaningful content from web text. The goal of this article is to separate meaningful data from redundant, repetitive and usually un-interesting data appearing on web pages.
Once the redundant data has been stripped from the page, the text content of the web page can be classified using known indexing techniques. The indexed web pages can then be evaluated by existing web search engines such as Google, MSN or Yahoo. The Lin et al. article discards as irrelevant portions of the web pages deemed to have redundant data, but does not change the indexing or evaluation of text pages found to have meaningful information.
A publication to Watters et al. entitled “Rating News Documents for Similarity” (Journal of the American Society for Information Science, 51(9): 793-804, 2000.) concerns a personalized delivery system for news documents. This publication discusses a methodology of associating news documents based on the extraction of feature phrases, where feature phrases identify dates, locations, people, and organizations. A news representation is created from these feature phrases to define news objects that can then be compared and ranked to find related news items.
In the context of the larger search problem, the current invention provides a means whereby users can quickly browse through a large collection of information and spot those items that are of interest to them by presenting only the content that is conceptually distinct.