The present invention relates in general to information delivery systems and methods, and in particular to systems and methods for presenting information based at least in part on publisher-selected labels. The labels are applied to content items by publishers and used to identify related content items in various situations.
The World Wide Web (Web), as its name suggests, is a decentralized global collection of interlinked information—generally in the form of “pages” that may contain text, images, and/or media content—related to virtually every topic imaginable. A user who knows or finds a uniform resource locator (URL) for a page can provide that URL to a Web client (generally referred to as a browser) and view the page almost instantly. Since Web pages typically include links (also referred to as “hyperlinks”) to other pages, finding URLs is generally not difficult.
What is difficult for most users is finding URLs for pages that are of interest to them. The sheer volume of content available on the Web has turned the task of finding a page relevant to a particular interest into what may be the ultimate needle-in-a-haystack problem. To address this problem, an industry of search providers (e.g., Yahoo!, MSN, Google) has evolved. A search provider typically maintains a database of Web pages in which the URL of each page is associated with information (e.g., keywords, category data, etc.) reflecting its content. The search provider also maintains a search server that hosts a search page (or site) on the Web. The search page provides a form into which a user can enter a query that usually includes one or more terms indicative of the user's interest. Once a query is entered, the search server accesses the database and generates a list of “hits,” typically URLs for pages whose content matches keywords derived from the user's query. This list is provided to the user. Since queries can often return hundreds, thousands, or in some cases millions of hits, search providers have developed sophisticated algorithms for ranking the hits (i.e., determining an order for displaying hits to the user) such that the pages most relevant to a given query are likely to appear near the top of the list. Typical ranking algorithms take into account not only the keywords and their frequency of occurrence but also other information such as the number of other pages that link to the hit page, popularity of the hit page among users, and so on.
To further facilitate use of their services, some search providers now offer “search toolbar” add-ons for Web browser programs. A search toolbar typically provides a text box into which the user can type a query and a “Submit” button for submitting the query to the search provider's server. Once installed by the user, the search toolbar is generally visible no matter what page the user is viewing, enabling the user to enter a query at any time without first navigating to the search provider's Web site. Searches initiated via the toolbar are processed in the same way as searches initiated at the provider's site; the only difference is that the user is spared the step of navigating to the search provider's site.
One technique for helping a user find content is to provide an interface via which the user can request “related” pages. Pages can be identified as related based on similarity of their content to that of the currently viewed page and/or whether the pages are published by the same entity. As implemented in existing systems, neither technique is very reliable.
Existing algorithms for identifying related pages based on similarity of content generally rely on overlap of textual elements (words, phrases, etc.) between the current page and the related page. The “best” matches according to such algorithms have the most overlap with the current page; however, the pages with the most overlap are often least interesting to the user, who typically wants to find pages with different information on the same subject. Determining whether two pages relate to the same subject is a difficult task, as it requires determining the subject of each page, which might or might not be evident from the words used.
Identifying pages published by the same entity is sometimes easier but is of limited help to the user. The publisher's own pages can sometimes be identified by URL, on the assumption that URLs beginning with the same domain name are commonly owned, but this assumption is not always reliable. For instance, some domains host content created by multiple independent publishers, and some publishers use multiple domains. Further domain-name matching does not provide a way to identify affiliates of a publisher, since the affiliates typically use different domains.
Therefore, it would be desirable to provide systems and methods for more efficiently identifying related content.