The World Wide Web is a massive collection of heterogeneous documents and content, and thus finding documents or content that relate to a particular subject may be challenging. Conventional Internet search engines are capable of retrieving information from the World Wide Web based upon keyword searches. With a conventional search engine, a user enters search terms or keywords that relate to the particular subject, and the search engine returns the web pages or URLs (Uniform Resource Locators) most relevant to those search terms or keywords.
Conventional search engines typically operate in two stages, i.e., a preparation stage and a search stage. In the preparation stage, the search engines scan all the documents on the World Wide Web using a web crawler and download the documents/content. The downloaded documents and content are indexed by the keywords contained within them to build a keyword index. For each web page that is crawled, all the searchable keywords are extracted, along with additional indicators of the relevance of each keyword such as frequency of occurrence, relative font size, position within document, and the like. In addition, a graph illustrating the hyperlink structure of the documents are built, where the nodes of the graph are the URLs of the documents and the edges between the nodes are the hyperlinks between the URLs corresponding to the documents. The importance of each node (URL) is determined by conventional page-rank algorithms.
Second, in the search stage, given a search item such as a keyword or a set of keywords, the search engines find all the matching web pages that match one or more keywords, and then attempt to sort the matching results in order of relevance or importance to the user based upon the search terms. In this regard, the search engines locate web pages matching the keywords by looking up the keyword index with the given search terms. The ranking of the found documents is determined using heuristics based on the importance of the keyword in each document, the number of matching terms, and the like. All the matching web pages are sorted (or ranked) in order of estimated importance to the user. The matching URLs are returned to the user typically in order of decreasing importance. Since the number of matching URLs can often be in the thousands, it is very useful to have a good ranking algorithm that can identify the most relevant results quickly for the user.
Conventional search engines estimate the importance (or relevance) of a particular matching web page typically based on two broad aspects: the content of the web page, and the hypertext (or citation) structure of the surrounding web. First, a conventional search engine analyzes the contents of a particular web page and examines criteria such as the frequency of occurrence of the search terms, the location of the search terms (e.g., the title is more relevant than the appendix), the font size of the search terms relative to the font size of the surrounding text, the document format (e.g., certain file formats such as word processing files are usually more important than other file formats such as simple web pages), the web location of the document (e.g., documents on major web portals are more important than those on an individual's web page), and the like. Each of these factors plays a role in determining the importance of a web page.
Second, a conventional search engine exploits the hypertext link structure of the World Wide Web by viewing it as a citation index. Pages that are referred to (linked to) by more pages are likely to be more important than pages that are linked to by fewer pages. Furthermore, pages that are referred to by important pages are themselves probably more important as well. This approach is described in greater detail, for example, in U.S. Pat. No. 6,526,440 to Bharat and in Lawrence Page et al., “The PageRank Citation Ranking: Bringing Order to the Web,” Technical Report, Stanford University, 1998.
FIG. 1 is a diagram illustrating the concept of using the hypertext link structure of the World Wide Web (WWW) to refine the score of a web page on the WWW. The term “score” of a web page is used herein to refer to the ranking score of the web page used for returning search results to a user in the order of descending ranking scores, and covers the concept of “page rank” in Internet searches or other similar concepts. The nodes 102, 104, 106, 108, 110 represent web pages or URLs, and the links 112, 114, 116, 118, 120 between these nodes 102, 104, 106, 108, 110 represent hyperlinks from one web page to another. A conventional way to compute the score of a web page is to divide the score of a page equally amongst its outgoing links and propagate the divided score proportionately to each destination document. For example, assume that URLs 102, 104 have scores of R=10 and R=9 initially. The score R=10 of URL 102 is equally divided along the links 112, 116 to nodes 106, 108 (each is given a score of 5). The score R=9 of URL 104 is equally divided along the links 114, 118, 120 to nodes 106, 108, 110 (each is given a score 3). The scores of the URLs 106, 108, 110 become R=8, R=8, R=3, respectively, which result from adding the scores divided to the nodes 106, 108, 110 along the links 112, 114, 116, 118, 120 incoming to the nodes 106, 108, 110. This process may be repeated for the next set of nodes whose score was modified as a result of this score propagation until a steady state solution is reached.
However, conventional search engines are not capable of monitoring how many times particular web pages or URLs were actually visited (i.e., the popularity of web pages) for use in determining the importance of those web pages, although the actual number of visits to a web page would strongly indicate the importance of the web page. Conventional search engines merely estimate the importance of a particular matching web page based upon the content of the page and the hypertext (or citation) structure of the surrounding web. The conventional search engines do not take into consideration the frequency of visits to the web page in estimating the importance of the web page. Furthermore, when propagating scores along the hypertext structure of the web, the score of a page is typically divided equally amongst the destination pages, rather than taking into consideration the relative popularity of the outgoing links from the page.
Therefore, there is a need for a method and system for monitoring and analyzing the actual popularity of pages on a network, for example, web pages. There is also a need for monitoring and analyzing the popularity of links between pages in a hyperlink network. There is also a need for a method and system for using the page popularity and/or link popularity in ranking the documents searched by a search engine.