In order to find information in related databases a computerized search is performed. For example, on the World Wide Web, it is often useful to search for web pages of interest to a user. Various techniques are used including providing key words as the search argument. The key words are often related by Boolean expressions. Search arguments may be selectively applied to portions of documents such as title, body etc., or domain URL names for example. The searches my take into account date ranges as well. A typical search engine will present the results of the search with a representation of the page found including a title, a portion of text, an image or the address of the page. The results are typically arranged in a list form at the user's display with some sort of indication of relative relevance of the results. For instance, the most relevant result is at the top of the list following in decreasing relevance by the other results. Other techniques indicating relevance include a relevance number, a widget such as a number of stars or the like. The user is often presented with a link as part of the result such that the user can operate a GUI interface such as a curser selected display item in order to navigate to the page of the result item. Other well known techniques include performing a nested search wherein a first search is performed followed by a search within the records returned from the first search. Today many search engines exist expressly designed to search for web pages via the internet within the World Wide Web. Various techniques are utilized to improve the user experience by providing relevant search results.
Traditionally, graph analysis based rank engines such as GOOGLE's PAGERANK (GOOGLE and PAGERANK are trademarks of GOOGLE Inc.) have presumed only a single type of link, the hyper-link.
GOOGLE is a World Wide Web search engine found at www.GOOGLE.com. GOOGLE search engine ranks pages found in a search using GOOGLE's PAGERANK application. GOOGLE's PAGERANK is described on the World Wide Web at www.webworkshop.net/PAGERANK.html in an article “GOOGLE's PAGERANK Explained and how to make the most of it” by Phil Craven incorporated herein by reference.
GOOGLE's PAGERANK is a numeric value that represents how important a page is on the web. GOOGLE figures that when one page links to another page, it is effectively casting a vote for the other page. The more votes that are cast for a page, the more important the page must be. Also, the importance of the page that is casting the vote determines how important the vote itself is. GOOGLE calculates a page's importance from the votes cast for it. How important each vote is taken into account when a page's PAGERANK is calculated.
According to the referenced Craven article: To calculate the PAGERANK for a page, all of its inbound links are taken into account. These are links from within the site and links from outside the site.PR(A)=(1−d)+d(PR(t1)/C(t1)+. . . +PR(tn)/C(tn))
That's the equation that calculates a page's PAGERANK. It's the original one that was published when PAGERANK was being developed, and it is probable that GOOGLE uses a variation of it but they aren't telling us what it is. It doesn't matter though, as this equation is good enough.
In the equation ‘t1-tn’ are pages linking to page A, ‘C’ is the number of outbound links that a page has and ‘d’ is a damping factor, usually set to 0.85.
We can think of it in a simpler way:—
a page's PAGERANK=0.15+0.85*(a “share” of the PAGERANK of every page that links to it)
“share”=the linking page's PAGERANK divided by the number of outbound links on the page.
A page “votes” an amount of PAGERANK onto each page that it links to. The amount of PAGERANK that it has to vote with is a little less than its own PAGERANK value (its own value * 0.85). This value is shared equally between all the pages that it links to.
From this, we could conclude that a link from a page with PR4 and 5 outbound links is worth more than a link from a page with PR8 and 100 outbound links. The PAGERANK of a page that links to yours is important but the number of links on that page is also important. The more links there are on a page, the less PAGERANK value your page will receive from it.
If the PAGERANK value differences between PR1, PR2, . . . PR10 were equal then that conclusion would hold up, but many people believe that the values between PR1 and PR10 (the maximum) are set on a logarithmic scale, and there is very good reason for believing it. Nobody outside GOOGLE knows for sure one way or the other, but the chances are high that the scale is logarithmic, or similar. If so, it means that it takes a lot more additional PAGERANK for a page to move up to the next PAGERANK level that it did to move up from the previous PAGERANK level. The result is that it reverses the previous conclusion, so that a link from a PR8 page that has lots of outbound links is worth more than a link from a PR4 page that has only a few outbound links.
Whichever scale GOOGLE uses, we can be sure of one thing. A link from another site increases our site's PAGERANK.
Note that when a page votes its PAGERANK value to other pages, its own PAGERANK is not reduced by the value that it is voting. The page doing the voting doesn't give away its PAGERANK and end up with nothing. It isn't a transfer of PAGERANK. It is simply a vote according to the page's PAGERANK value. It's like a shareholders meeting where each shareholder votes according to the number of shares held, but the shares themselves aren't given away. Even so, pages do lose some PAGERANK indirectly, as we'll see later.
For a page's calculation, its existing PAGERANK (if it has any) is abandoned completely and a fresh calculation is done where the page relies solely on the PAGERANK “voted” for it by its current inbound links, which may have changed since the last time the page's PAGERANK was calculated.
The equation shows clearly how a page's PAGERANK is arrived at. But what isn't immediately obvious is that it can't work if the calculation is done just once. Suppose we have 2 pages, A and B, which link to each other, and neither have any other links of any kind. This is what happens:—
Step 1: Calculate page A's PAGERANK from the value of its inbound links
Page A now has a new PAGERANK value. The calculation used the value of the inbound link from page B. But page B has an inbound link (from page A) and its new PAGERANK value hasn't been worked out yet, so page A's new PAGERANK value is based on inaccurate data and can't be accurate.
Step 2: Calculate page B's PAGERANK from the value of its inbound links
Page B now has a new PAGERANK value, but it can't be accurate because the calculation used the new PAGERANK value of the inbound link from page A, which is inaccurate.
It's a Catch 22 situation. We can't work out A's PAGERANK until we know B's PAGERANK, and we can't work out B's PAGERANK until we know A's PAGERANK.
Now that both pages have newly calculated PAGERANK values, can't we just run the calculations again to arrive at accurate values? No. We can run the calculations again using the new values and the results will be more accurate, but we will always be using inaccurate values for the calculations, so the results will always be inaccurate.
The problem is overcome by repeating the calculations many times. Each time produces slightly more accurate values. In fact, total accuracy can never be achieved because the calculations are always based on inaccurate values. 40 to 50 iterations are sufficient to reach a point where any further iterations wouldn't produce enough of a change to the values to matter. This is precisely what GOOGLE does at each update, and it's the reason why the updates take so long.
One thing to bear in mind is that the results we get from the calculations are proportions. The figures must then be set against a scale (known only to GOOGLE) to arrive at each page's actual PAGERANK. Even so, we can use the calculations to channel the PAGERANK within a site around its pages so that certain pages receive a higher proportion of it than others.
The GOOGLE algorithm is further discussed in “The Anatomy of a Large-Scale Hypertextual Web Search Engine” by Brin and Page on the World Wide Web at: “citeseer.ist.psu.edu/cache/papers/cs/13017/http:zSzzSzwww-db.stanford.eduzSzpubzSzpaperszSzGOOGLE.pdf/brin98anatomy.pdf” and incorporated herein by reference.
U.S. Patent application Publication No. 2002/0129014A1 “Systems and methods of retrieving relevant information” filed Jan. 10, 2001 incorporated herein by reference provides systems and methods of retrieving the pages according to the quality of the individual pages. The rank of a page for a keyword is a combination of intrinsic and extrinsic ranks. Intrinsic rank is the measure of the relevancy of a page to a given keyword as claimed by the author of the page while extrinsic rank is a measure of the relevancy of a page on a given keyword as indicated by other pages. The former is obtained from the analysis of the keyword matching in various parts of the page while the latter is obtained from the context-sensitive connectivity analysis of the links connecting the entire Web. The patent also provides the methods to solve the self-consistent equation satisfied by the page weights iteratively in a very efficient way. The ranking mechanism for multi-word query is also described. Finally, the application provides a method to obtain the more relevant page weights by dividing the entire hypertext pages into distinct number of groups.
U.S. Pat. No. 6,701,305 “Methods, apparatus and computer program products for information retrieval and document classification utilizing a multidimensional subspace” filed Oct. 20, 2001 and incorporated herein by reference describes methods, apparatus and computer program products for retrieving information from a text data collection and for classifying a document into none, one or more of a plurality of predefined classes. In each aspect, a representation of at least a portion of the original matrix is projected into a lower dimensional subspace and those portions of the subspace representation that relate to the term(s) of the query are weighted following the projection into the lower dimensional subspace. In order to retrieve the documents that are most relevant with respect to a query, the documents are then scored with documents having better scores being of generally greater relevance. Alternatively, in order to classify a document, the relationship of the document to the classes of documents is scored with the document then being classified in those classes, if any, that have the best scores.
The prior art fails to consider page link attributes when ranking documents.
The Semantic Web provides a common framework that allows data to be shared and reused across application, enterprise, and community boundaries. It is a collaborative effort led by W3C with participation from a large number of researchers and industrial partners. It is based on the Resource Description Framework (RDF), which integrates a variety of applications using XML for syntax and URIs for naming. Information about RDF including “Resource Description Framework (RDF) Model and Syntax Specification found at “www.w3.org/TR/1999/REC-rdf-syntax-19990222”; “Resource Description Framework (RDF) Schema Specification at “www.w3.org/TR/1999/PR-rdf-schema-19990303”; and “RDF/XML Syntax Specification (Revised) at “www.w3.org/TR/rdf-syntax-grammar” all of which are incorporated herein by reference.
“The Semantic Web is an extension of the current web in which information is given well-defined meaning, better enabling computers and people to work in cooperation.”—Tim Berners-Lee, James Hendler, Ora Lassila, The Semantic Web, Scientific American, May 2001. More information about the semantic web can be found on the World Wide Web in the W3C Technology and Society Domain document “Semantic Web” at www.w3.or/2001/sw incorporated herein by reference.
An improved search method is needed that more particularly identifies the importance of search targets, thereby providing improved search results.