§1.1 Field of the Invention
Embodiments consistent with the present invention concern information retrieval (IR). In particular, embodiments consistent with the present invention concern improving IR of documents, such as Web pages for example, that belong to one of a plurality of sets of documents, such as Websites for example.
§1.2 Background Information
Search engines have been very useful in helping people find information of interest on the World Wide Web (“the Web”), as well as on other networks. An exemplary search engine is described in the article S. Brin and L. Page, “The Anatormy of a Large-Scale Hypertextual Search Engine,” Seventh International World Wide Web Conference, Brisbane, Australia and in U.S. Pat. No. 6,285,999 (both incorporated herein by reference). A search engine may receive queries for search results. In response, the search engine may retrieve relevant search results (e.g., from an index of Web pages). Such search results may include, for example, lists of Web page titles, snippets of text extracted from those Web pages, and hypetext links to those Web pages, and may be grouped into predetermined number of (e.g., ten) search results.
FIG. 1 is a high-level block diagram of an environment 100 that may include a network (such as the Internet for example) 160 in which an information access facility (client device) 110 is used to render information accessed from one or more content providers (e.g., Web page servers) 180. A search facility (server) 130 may be used by the information access facility 110 to search for content of interest.
The information access facility 110 may include browsing operations 112 which may include navigation operations 114 and user interface operations 116. The browsing operations 112 may access the network 160 via input/output interface operations 118. For example, in the context of a personal computer, the browsing operations 112 may be performed by a browser (such as Firefox from Mozilla, Netscape from AOL Time Warner, Opera from Opera Software, Explorer from Microsoft, etc.) and the input/output interface operations may be performed by a modem or network interface card (or NIC) and networking software. Other examples of possible information access facilities 110 include untethered devices, such as personal digital assistants (“PDAs”) and mobile telephones for example, set-top boxes, kiosks, media players, etc.
Each of the content providers 180 may include stored resources (also referred to as content) 136, resource retrieval operations 184 that access and provide content in response to a request, and input/output interface operations 182. These operations of the content providers 180 may be effected by computers, such as personal computers or servers for example. Accordingly, the stored resources 186 may be embodied as data stored on some type of storage medium such as a magnetic disk(s), an optical disk(s), etc. In this particular environment 100, the term “document” may be interpreted to include addressable content, such as a Web page for example.
The search facility 130 may perform crawling, indexing/sorting, and query processing functions. These functions may be performed by the same entity or separate entities. Further, these functions may be performed at the same location or at different locations. In any event, at a crawling facility 150, crawling operations 152 get content from various sources accessible via the network 160, and store such content, or a form of such content, as indicated by 154. Then, at an automated indexing/sorting facility 140, automated indexing/sorting operations 142 may access the stored content 154 and may generate a content index (e.g., an inverted index, to be described below) and content ratings (e.g., PageRanks, to be described below) 140. Finally, query processing operations 134 accept queries and return query results based on the content index (and the content ratings) 140. The crawling, indexing/sorting and query processing functions may be performed by one or more computers.
FIG. 2 is a process bubble diagram of an advanced search facility 200. The advanced search facility 200 illustrated in FIG. 2 performs three main functions: (i) crawling; (ii) indexing/sorting; and (iii) searching. The horizontal dashed lines divide FIG. 2 into three parts corresponding to these three main functions. More specifically, the first part 150′ corresponds to the crawling function, the second part 140′ corresponds to the indexing/sorting function, and the third part 134′ corresponds to the search (or query processing) function. (Note that an apostrophe “'” following a reference number is used to indicate that the referenced item is merely one example of the item referenced by the number without an apostrophe.) Each of these parts is introduced in more detail below. Before doing so, however, a few distinguishing features of this advanced search facility 200 are introduced. The advanced search facility uses the link structure of the Web, as well as other techniques, to improve search results.
Still referring to FIG. 2, the three main parts of the advanced search engine 200 are now described further. The crawling part 150′ may be distributed across a number of machines. A single URLserver (not shown) serves lists of uniform resource locations (“URLs”) 206 to a number of crawlers. Based on this list of URLs 206, the crawling operations 202 crawl the network 160′ and get Web pages 208. Pre-indexing operations 210 may then generate page rankings 212, as well as a repository 214 from these Web pages 208. The page rankings 212 may include a number of URL fingerprint (i.e., a unique value), Page rank value as pairs. The repository 214 may include URL, content type and compressed page triples.
Regarding the indexing/sorting part 140′, the indexing/sorting operations 220 may generate an inverted index 226. The indexing/sorting operations 220 may also generate page ranks 228 from the citation rankings 212. The page ranks 228 may include document ID, PageRank value pairs.
Regarding the query processing part 134′, the searching operations 230 may be run by a Web server and may use a lexicon 232, together with the inverted index 226 and the PageRanks 228, to generate query results in response to a query. The query results may be based on a combination of (i) information derived from PageRanks 228 and (ii) information derived from how closely a particular document matches the terms contained in the query (also referred to as the information retrieval (or “IR”) component).
As useful as such search engines, such as the one just introduced, have been, there is room for improvement. Consider, for example, the following two (2) scenarios.
First, consider the search query “Ramada Cincinnati”. The present inventors believe that the most authoritative and useful search result would be for the Web page on the Ramada Website that describes its hotel in downtown Cincinnati. Consequently, it would be desirable to return (information about, and a link to) that Web page as the first search result. Unfortunately, while there is a lot of evidence indicating that the main Web page of the Ramada Website is authoritative for the word “Ramada”, there might be little evidence that the Web page for its particular hotel in downtown Cincinnati is authoritative for the word “Ramada”. Consequently, at least some search engines processing the search “Ramada Cincinnati” would return the main Web page of the Ramada Website as the first search result, even though it might not be as useful as the Web page on the Ramada Website for its hotel in downtown Cincinnati. Worse, at least some search engines might not return any Web page on the Ramada Website as one of its top search results.
Second, consider the search query “three seasons palo alto”. In this example, the main Web page of the Website for the “Three Seasons” restaurant does not include the address of the restaurant. Thus, although there is a lot of evidence that the main Web page for the Website of the restaurant is authoritative for Three Seasons, there is no evidence on this main Web page that suggests it pertains to Palo Alto. Note that other Web pages on the Website do indicate that the restaurant is in Palo Alto.
As the foregoing examples illustrate, an automated search engine that uses just the information directly about a Web page (e.g., the words on the Web page and their structure, the words in anchors pointing to the Web page, and the Page-rank of the Web page) might not be able to find the Web page that would be the most useful for a particular query. Thus, it would be useful to improve search engines so that they return better search results. In particular, it would be useful to improve search engines (e.g., by improving the information that they process) so that while a search engine ranks the relevance of a term (e.g., words and/or phrases) of a query to one Web page, it may take account of the pertinence of the term to other Web pages on the same Web site. More generally, it would be useful to improve applications that use the same or similar IR techniques.