1. Field of the Invention
The present invention relates generally to information retrieval systems and more particularly to methods and systems for ranking nodes in a collection.
2. Description of Related Art
The Internet is a global network of individual computers (such as clients and servers) linked to each other by the Internet Protocol. The World Wide Web allows client programs to retrieve information (such as Web pages or files) from the Internet based on Uniform Resource Locators (URLs; also known as Uniform Resource Indicators or URIs). An example of a client program is a web browser that runs on a user's computer to help locate web pages or files. Each web page or file is associated with a unique URL that allows client programs to specify the host server that the web page or file is stored on. The main components of a URL include the scheme, the host or server name, the port, the path, and/or a query. For example, if a user enters http://example.com/index, the scheme or access type is “http”, the host or server name is “example.com”, and the path is “index”. Sometimes, the user might enter a query in a browser toolbar on his local browser program to request something specific. Instead of entering the URL of a page, a user may also follow a hypertext link to a page or resource. However, before the client computer can make a connection to the server to retrieve the page, the host or server name portion of the URL must be converted to an IP address. The Domain Name Service (DNS) is a global distributed Internet database network that relies on resolvers and name servers, and is used to map host or server names to their associated IP address.
In this global network, each web site or web page is assigned a unique URL and associated identifying number called an Internet Protocol Address or IP Address. The IP Address of each web site is stored in one or more DNS servers which, in turn, provide that address to other computers in response to queries for that site. Oftentimes, users do not know the IP Address or URL of the web page containing the information they are looking for, or if such a page even exists. In this case, a user will typically enter keywords into a search engine or link to the web site from a referring web site. To perform a keyword search, a user will often go to the site of an Internet search engine, such as Google™ or Yahoo™, and type in one or more words or phrases that are relevant to his query. In response to the keyword search, a search engine will typically return several URLs, from which the requestor can select the most appropriate web page for his purposes. However, the pages returned in response to a query are often quite numerous, in which case the user is often required to sort through many results before finding the page of interest. For example, if a user types in the word “car”, the result returned by the search might be a lengthy list of web sites ranging from car manufacturers, car dealerships, car repair shops, car enthusiast clubs, and the like. For this reason, web sites are often “ranked” to further sort the results of a query by relevance.
Various techniques for ranking web pages are known in the art. U.S. Pat. No. 6,285,999 to Page describes a query-independent model for ranking pages in the World Wide Web. The patent relates to the “PageRank” algorithm which relies on the static link structure of the Web and iterative techniques to form the basis for Google's search engine page rankings. For example, if rk+1(Pi) is the PageRank of page Pi at iteration k+1, the PageRank algorithm may be denoted as:
                                                        r                              k                +                1                                      ⁡                          (                              P                i                            )                                =                                    ∑                                                P                  j                                ∈                                  B                                      P                    i                                                                        ⁢                                                            r                  k                                ⁡                                  (                                      P                    j                                    )                                                                                              P                  j                                                                                    ,                            (                  Eqn          .                                          ⁢          1                )            where BPi is the set of pages backlinking to Pi (1).
In general, PageRank measures the relative “popularity” or “importance” of a page based on the number of pages, or “in-links”, that point to it. As an illustration, FIG. 1 shows a directed graph representing six pages (denoted as nodes 1-6) (1). Using the depicted nodes and links, a normalized hyperlink matrix H may be formed representing the status of the links from a given node i to a node j.
                    H        =                                                                              P                  1                                                                                                      P                  2                                                                                                      P                  3                                                                                                      P                  4                                                                                                      P                  5                                                                                                      P                  6                                                              ⁢                                                                                                                P                      1                                        ⁢                                                                                  ⁢                                          P                      2                                        ⁢                                                                                  ⁢                                          P                      3                                        ⁢                                                                                  ⁢                                          P                      4                                        ⁢                                                                                  ⁢                                          P                      5                                        ⁢                                                                                  ⁢                                          P                      6                                                                                                                                        (                                                                                            0                                                                                                      1                            /                            2                                                                                                                                1                            /                            2                                                                                                    0                                                                          0                                                                          0                                                                                                                      0                                                                          0                                                                          0                                                                          0                                                                          0                                                                          0                                                                                                                                                  1                            /                            3                                                                                                                                1                            /                            3                                                                                                    0                                                                          0                                                                                                      1                            /                            3                                                                                                    0                                                                                                                      0                                                                          0                                                                          0                                                                          0                                                                                                      1                            /                            2                                                                                                                                1                            /                            2                                                                                                                                                0                                                                          0                                                                          0                                                                                                      1                            /                            2                                                                                                    0                                                                                                      1                            /                            2                                                                                                                                                0                                                                          0                                                                          0                                                                          1                                                                          0                                                                          0                                                                                      )                                                                        .                                              (                  Eqn          .                                          ⁢          2                )            For the above matrix, equation 1 may be re-written as:π(k+1)T=π(k)TH  (Eqn. 3)
In reality, the matrix of the entire web is an immense matrix that does not always contain ideal conditions. Accordingly, many adjustments have been made to the original PageRank algorithm, resulting in the Google Matrix G:G=αS+(1−α)1/neeT  (Eqn. 4)
Thus, after various adjustments, the PageRank method turns out to be:π(k+1)T=π(k)TG  (Eqn. 5)which may be solved by applying the power method to G.
It typically takes a long time to compute PageRank for the Web using the power method. Because PageRank takes such a long time to compute, rankings can only be updated after certain intervals. Thus, the rankings are generally not as accurate at the end of the interval as they are at the beginning. While “out-of-date” rankings might not significantly impact pages whose content rarely changes, they are not reliable for pages with rapidly changing content (such as pages providing news and current events).
There are several other notable drawbacks to PageRank as well. For one, PageRank tends to favor older pages. This is because new pages initially will not have many links (unless they are part of an existing site). Moreover, due to the reliance upon the static nature of Web links, PageRank values can be easily manipulated (e.g., by creating link farms) to improve search result rankings and monetize advertising links. For example, any page with a low PageRank can be redirected to a page with a high PageRank, thereby causing the page with the low PageRank to assume the PageRank of the page it is pointing to. In addition, pages with no incoming links can be redirected to the Google home page and by the next PageRank update, the new page will be upgraded to a higher PageRank (this is called spoofing and is another flaw in the PageRank system). These weaknesses, and others, have severely impacted the reliability of PageRank, which seeks to determine which documents are actually highly valued in by the Web community. Google is known to actively penalize link farms and other schemes designed to artificially inflate PageRank. How Google identifies link farms and other PageRank manipulation tools are among Google's trade secrets.
In “Exploiting the Block Structure of the Web for Computing PageRank” (2), and in U.S. Patent Application Publication No. 2005/0033742, Kamvar and colleagues introduce a ranking technique termed “BlockRank” for speeding up the processing times of PageRank based on aggregation principles and the structure of the Web. These documents and the technology disclosed in them seek to address problems encountered by PageRank, by providing a ranking technique aimed at reducing the number of iterations required, as well as the work per iteration. In general, the BlockRank model approximates the global PageRank by dividing the webgraph into k blocks and performing calculations on a compact representation of the webgraph. The compact representation is obtained by aggregating pages of a host to a single node using conventional aggregation principles. See also references (3) and (4), below.
According to Kamvar and colleagues, local PageRank values can be calculated for each individual host by ignoring “inter-host” links. Thus, the “local PageRank vector” lJ of a block J (GJJ) may be defined as the result of the PageRank algorithm applied only on block J (ignoring interlinks to other hosts) such that:{right arrow over (l)}J=pageRank(GJJ,{right arrow over (s)}J,{right arrow over (v)}J)  (Eqn. 6)where the start vector sJ is the nJ×1 uniform probability vector, and the personalization vector vJ is the nJ×1 vector whose elements are all zero except the element corresponding to the root node of block J, whose value is 1.
In addition to local page ranks, the relative importance of each block may also be computed. Thus, assuming there are k blocks in the Web, a block graph B is created where each vertex in the graph corresponds to a block in the web graph. The weight of an edge BIJ between two blocks is given by:
                              B          IJ                =                              ∑                                          i                ∈                I                            ,                              j                ∈                J                                              ⁢                                    A              ij                        ·                          I              i                                                          (                  Eqn          .                                          ⁢          7                )            and may be written in matrix notation such that a PageRank matrix L is the n×k matrix whose columns are the local PageRank vectors lJ:
                    L        =                  (                                                                                          l                    →                                    1                                                                              0                  →                                                            ⋯                                                              0                  →                                                                                                      0                  →                                                                                                  l                    →                                    2                                                            ⋯                                                              0                  →                                                                                    ⋮                                            ⋮                                            ⋰                                            ⋮                                                                                      0                  →                                                                              0                  →                                                            ⋯                                                                                  l                    →                                    k                                                              )                                    (                  Eqn          .                                          ⁢          8                )            
A matrix S is then defined to be the n×k matrix that has the same structure as L with all nonzero entries replaced by 1. The k×k block matrix B is then:B=LTAS  (Eqn. 9)where B is a transition matrix representing the transition probability of block I to block J. The PageRank algorithm may then be applied to the reduced matrix resulting in the BlockRank vector b:{right arrow over (b)}=pageRank(B,{right arrow over (v)}k,{right arrow over (v)}k)  (Eqn. 10)
Further according to Kamvar and colleagues, a global PageRank may be approximated using the local PageRanks lJ of the pages in each block, and BlockRank vector b whose elements bJ are the BlockRank for each block J (indicating the relative importance of the blocks). Thus, the global PageRank is approximated by the local PageRank lj weighted by the BlockRank bJ of the block it resides in. The global page rank x may be approximated in matrix notation as:{right arrow over (x)}(0)=L{right arrow over (b)}  (Eqn. 11)
One advantage to the BlockRank model, as noted by Kamvar and colleagues, is that the local PageRank vectors converge more quickly, thus requiring fewer iterations. Moreover, the local PageRanks can be computed in a distributed or parallel manner and/or pre-computed. In some cases, the local PageRanks may be re-used in future applications. A major drawback to the BlockRank approximation is that some information is lost in the compression or aggregation step by ignoring intra-host links. However, the approximation can be improved by expanding and collapsing repeatedly until convergence is reached. Another drawback is that there do not appear to be any uniform or natural geographic divisions to the blocks in the model, and therefore it might be difficult to determine what populations the blocks are representative of. More
Another traffic flow technique, termed “TrafficRank”, is used by Alexa Internet, Inc. (6). Generally speaking, Alexa calculates traffic rankings for websites by analyzing Internet traffic of millions of Alexa Toolbar users (where the traffic ranks are typically based on months of aggregated traffic data). However, one drawback to this approach is that such TrafficRank results contain inherent biases and therefore do not necessarily reflect a representative sample of the global Internet population. For example, the Alexa Toolbar only works with the Internet Explorer browser (i.e., it is not supported by Mozilla, AOL, Netscape, etc.) and on Windows operating systems. In addition, in some cases, TrafficRank calculations can take longer than PageRank calculations (especially on a large scale).
Although there are already numerous drawbacks to existing page ranking techniques as previously discussed, they further lack the ability to perform service tasks at the ISP level of operation, thus limiting the functionality and capability of such systems and methods. Because search engines currently operate globally at central locations within the cloud computing of the Internet, queries cannot be resolved to their origin of request. Search engines are thus limited in conducting reliable business services and tracking such as, market channel tracking, web page usage, DNS statistics, and so forth. These services and tracking are currently only possible from the edge of the Internet, which is where ISPs sit.
Thus, there remains a need to rank pages more quickly and efficiently. In addition, there also exists a need to rank pages in a manner more relevant to a particular user or group of users or the behavior of users or a group of users. Furthermore, there remains a need to provide participating service partners and/or ISPs with valuable session data and/or the ability to provide more relevant information in response to one or more queries. importantly, the BlockRank model only addresses static links and does not take into account traffic flow. Because BlockRank only uses static links, it suffers many of the same problems suffered by PageRank
All of the above mentioned ranking methodologies measure popularity or relevance of a page or site based on the static link structure of the Web. However such techniques to not take into account the amount of Web traffic with respect to that page and therefore do not provide a true measure of page popularity or relevance.
Tomlin, in “A New Paradigm for Ranking Pages on the World Wide Web” (2003), introduces an alternative method for measuring popularity of a page based on traffic flow (5). Using an entropy-based approach, the traffic flow is subject to conservation conditions of a circulation flow in the entire World Wide Web, an aggregation of the World Wide Web, or a sub-graph of the World Wide Web. According to the traffic flow approach, a traffic rank pij may be considered the proportion of all Web traffic on a link entering page j from page i (assuming the sum of all pij=1). An optimization problem shown below may then be used to find the pij's for the traffic rank model:
                              max          -                                    ∑                              i                ,                j                                      ⁢                                          p                ij                            ⁢              log              ⁢                                                          ⁢                              p                ij                            ⁢                                                          ⁢              subject              ⁢                                                          ⁢              to                                      ⁢                                  ⁢                                                            ∑                                  i                  ,                  j                                            ⁢                              p                ij                                      =            1                    ,                                          ⁢                                                                      ∑                  i                                ⁢                                  p                  ij                                            -                                                ∑                  i                                ⁢                                  p                  ij                                                      =            0                    ,                                          ⁢                      for            ⁢                                                  ⁢            every            ⁢                                                  ⁢            j                    ,                                          ⁢                                    p              ij                        ≥            0.                                              (                  Eqn          .                                          ⁢          12                )            Thus, in contrast to PageRank or BlockRank approaches, traffic flow models measure the “popularity” of a page, or node, based on the amount of traffic to and/or from that node, rather than the number of static links to the node.