The World Wide Web is a source of overwhelming information overload. Current search engines do a good job in finding pages that are relevant to a keyword search, but suffer from major drawbacks, including: (1) they do not work for content that does not have keywords or for which keywords are a poor description; (2) they do not work well for timely or brand new content; (3) they are typically not personalized—they treat each Web user as an equal. A solution that addresses these problems would be extremely useful.
Search engines index hundreds of millions or more web pages to make content accessible to users. Some search engines have maintained indices of web sites that were developed and maintained by individuals to make content more accessible to persons seeking the content. However, the ability of any given set of individuals to maintain comprehensive and up-to-date indices is limited. As the number of sources of digital content available on the Internet or in other networks and databases continues to grow at an increasing rate, indices of sites managed and maintained by persons who evaluate and classify each site have become unworkable.
The well-known search engine Google was developed, employing an iterative algorithm to calculate a “Page Rank” based, in part, on the link structure and anchor text of web links, to identify those pages that are most highly linked and prioritize search results. Details regarding this algorithm and its operation are discussed in the paper by Sergey Brin and Lawrence Page titled “The Anatomy of a Large-Scale Hypertextual Web Search Engine,” which is incorporated by reference herein. The Page Rank algorithm assigns a greater value to those pages that are more frequently linked to than other pages that are less frequently linked to, thereby employing the structure of the Internet itself to prioritize the results of keyword searches. This approach has enjoyed considerable success at indexing the Internet. However, the Page Rank algorithm has several shortcomings. In particular, it is not well suited to finding content that has recently been added to the Internet, but has not persisted long enough for many others to link to it. In addition, to achieve high quality results, the Page Rank algorithm is dependent on generating an maintaining a citation (link) graph of a substantial portion of the entire Internet, which, even in an automated system, can require considerable time and enormous computing resources. While theoretical models have been proposed to improve the efficiency of PageRank, indexing the entire Internet remains an inherently resource intensive task. See e.g. Haveliwala, T., “Efficient Computation of Page Rank.” In addition, the prioritization approach used by the Page Rank algorithm, while very useful at prioritizing pages that have the greatest number of links in the Internet as a whole, and presumably greater interest to the population as a whole, is not personalized.
Another drawback of the PageRank algorithm is that it fails to discriminate between links that are referential, endorsing, or criticizing. The PageRank algorithm considers all links into a page to be a vote for that page, even though many links to a given page may, in fact, be critical of or otherwise suggest that the content of a given page is untrustworthy. Proposals to extend the linking language of the Internet to accommodate such notions of trusted and untrusted links have been proposed by Hayes, C., in his article “Page-reRank: using trusted links to re-rank authority,” which is incorporated herein by reference. Such approaches would require the wide-scale adoption of changed practices, and fail to address the situation in which given users or communities of interest have different criteria for determining which content is more trustworthy than other content.
There has been research in the area of rating and recommendation systems pertaining to the use and propagation of trust as a device to develop improved ratings and recommendations. See e.g. Guha, R. et al., “Propagation of Trust and Distrust”; see also Guha, R., “Open Rating Systems.” However, such technologies typically have not been applied to processing sources of content or generating sources of content.
A system has been proposed in which persons give explicit ratings to each element in a network to improve the quality of content delivered to users. See Josang, A., and Ismail, R., “The Beta Reputation System.” While useful for some applications, many users, particularly consumers of information will not take time to offer explicit ratings by which elements of content may be scored. A system that scores the elements without requiring user ratings of all the elements is desirable.
Syndication formats or protocols allows content providers to publish, and users to subscribe to content that is posted to web-sites. RSS, or “Really Simple Syndication,” is an example of one such protocol that uses an XML-based system to allow users to receive content automatically. Such syndication formats may provide selected content, links to content, and metadata about the linked content. User programs, including browsers, feed reader, aggregators, and the like enable users to download the content that has been added to a feed on demand or periodically. The content delivered by web feeds is typically webpage content, but may also include links to other webpages, images, audio, image, and video content, or other kinds of digital information.
RSS 2.0 organizes the data in its feeds into channels, which typically comprise one or more items. The items, in turn, comprise links and metadata, which may include such information as title, description, date of publication, and the author or source of the content. Other syndication formats, such as Atom 1.0, are evolving and contain similar types of structure for syndicating collections of links to content and metadata about those links. These various syndication formats, including not only the versions of RSS, but also versions of the Atom format and other syndication format or protocols are commonly referred to as RSS feeds. The term RSS feed, as used in this specification, should be understood to broadly encompass these and other syndication formats or protocols.
Subscribing to a syndicated web publication or feed has the advantage that content that has recently been added by the publisher, blogger, portal operator, or other feed provider, is immediately accessible to subscribers. Moreover, since the subscriber presumably has chosen to subscribe to a given feed due to an interest in its content, feeds are a useful technology for receiving desired content, and, depending on the process by which a particular feed was assembled, might reflect a degree of personalization on the part of those who edit the feed. But, due to the vast and increasing number of sources of feeds and related content, monitoring a large set of feeds to identify relevant content remains a burdensome problem.
Tags are a mechanism that can be used to store metadata about the elements of a network, such as the digital data content to which the links of web pages, RSS and other feeds, queries, databases, email messages and the like may refer. See e.g. Golder, S. et al., “The Structure of Collaborative Tagging Systems.” In recent years, tagging sites that provide a capability for users to manage a collection of bookmarks and keyword or other information relating to Internet sites have become popular. These sites can be a useful source of content because they contain information that an individual has designated a specific set of links to be trustworthy, and because the links are stored with metadata about the items to which the links refer. Tags or entries are also employed by feeds, which typically contain header information and several tags or entries which include one or more uniform resource locators or links and predefined fields of metadata corresponding to those URLs.