Search engines are vital for helping a user find specific information in the vast expanse of the World Wide Web (WWW or Web). Because the Web continues to grow at a phenomenal rate, it would be virtually impossible to locate anything on the Web without knowing a specific address if not for search engines. Generally, a search engine refers to a system that maintains an index structure of a collection of documents to efficiently generate a list of documents that contain specified keywords and ranks the document list according to a relevance measurement. Global search engines, which are popular and widespread, are used to search the entire Web, while local search engines are used to search web sites and intranets.
Many types of popular and effective global search engines use link analysis to quickly and efficiently search the entire Web. These search engines analyze links to rank web sites (or pages) according to, among other things, the quality and quantity of other sites that are linked to them. In general, a link (in a hypertext context such as the Web) is a reference to another page or site. When a user clicks on a link within a site, the user is taken to the other site. In theory, the more sites that link are linked to a certain site, the higher ranking the search engine will give the particular web site because more links indicates a higher level of popularity among users.
Link analysis techniques (such as HITS and PageRank) are widely used to analyze the importance of a page. In both the HITS and PageRank techniques, the Web is represented a directed graph G={V, E}, where V stands for web-pages wi, and E stands for the hyperlinks li,j within two pages. For the HITS technique, each web-page wi has both a hub score hi and an authority score ai. The hub score of wi is the sum of all the authority scores of pages that are pointed by wi; the authority score of wi is the sum of all the hub scores of pages that point to wi, as shown in the following equations.
            a      i        =                  ∑                  j          :                                    l                              j                ,                i                                      ∈            E                              ⁢                          ⁢              h        j              ,            h      i        =                  ∑                  j          :                                    l                              j                ,                i                                      ∈            E                              ⁢                          ⁢              a        j            The final authority and hub scores of every web page are obtained through an iterative update process.
PageRank is a core algorithm of the popular Google search engine (http://www.google.com.). PageRank measures the importance of web pages. specifically, PageRank uses the whole linkage graph of the Web to compute universal query-independent rank value for each page. A users' browsing model is modeled as a random surfing model. This model assumes that a user either follows a link from a current page or jumps to a random page in the graph. The PageRank of a page wi then is computed by the following equation:
      PR    ⁡          (              w        i            )        =            ɛ      n        +                  (                  1          -          ɛ                )            ×                        ∑                                    l                              j                ,                i                                      ∈            E                          ⁢                                  ⁢                              PR            ⁡                          (                              w                j                            )                                /                      outdegree            ⁡                          (                              w                j                            )                                          where ε is a dampening factor, which is usually set between 0.1 and 0.2, n is the number of nodes in G, and out-degree (wj) is the number of the edges leaving page wj (i.e., the number of hyperlinks on page wj). The PageRank can be computed by an iterative algorithm and corresponds to the primary eigenvector of a matrix derived from adjacency matrix of the available portion of the Web.
Although these global search engines work relatively well for searching the Web, they are unavailable for local searches, such as searches of a web site or an intranet. A web site can be thought of as a closed space on the web where data and information are available to a user. For example, web sites include enterprise portals (allowing document access and product information), server providers (including access to news and magazines), education institutions providing online courses and document access, and user groups, to name a few. Frequently, to obtain specific and up-to-date information, a user will often go directly to a specific web site and conduct site search. However, in addition to being unavailable for local searches, global search engines are also impractical for local searching because the link structure of a web site and intranet is different from the Web. In the closed sub-space of a web site or intranet local search engines must used.
Existing local (or small web) search engines generally use the same link analysis technology as those used in global search engines. However, their performances are problematic. Some current site-specific search engines fail to deliver all the relevant content, instead returning too much irrelevant content to meet the user's information needs. Furthermore, little benefit is obtained from the use of link-based methods.
One problem with using link analysis for local searches is that the link structure of a small web is different from the global Web. As explained in detail below, for the global Web, existing link analysis uses explicit links to a certain site to determine the ranking of the site. While this recommendation assumption is generally correct for the Web, it is commonly invalid for a Web site or intranet. In general, this is because there are relatively few explicit links and the links are created by a small number of authors whose purpose is to organize the contents into a hierarchical structure. Thus, in general the authority of pages is not captured correctly by link analysis.
Since direct application of link analysis in a local searching is impractical, some systems focus on usage information. For example, DirectHit (http://www.directhit.com) harnesses millions of human decisions by millions of daily Internet searchers to provide more relevant and better organized search results. DirectHit's site ranking system, which is based on the concepts of “click popularity” and “stickiness,” is currently used by Lycos, Hotbot, MSN, Infospace, About.com and several other search engines. The underlying assumption is that the more a web-page is visited, the higher it is ranked according to particular queries. These usage-based search engines, however, have restrictions. In particular, one problem is that the technique requires large amounts of user logs and only works for some popular queries. Another problem is that it is easy to fall into a quick positive feedback loop when access to a popular resource leads to its higher rank. This in turn leads to an even higher number accesses to it.
There are also some techniques that operate by combining usage data in link analysis. One such technique utilizes usage data to modify the adjacency matrix in the HITS technique. Namely, the adjacency matrix M is replaced with a link matrix M′, which assign the weight between nodes (pages) based on a user's usage data collected from web-server logs.
One problem, however, with this method is that it does not separate the user logs into sessions based on their tasks. This makes the technique vulnerable to noise data that inevitably will be introduced into the link matrix. Another problem is that Web users often follow different paths to reach a same goal. If only adjacent pages are treated as related, the underlying relationship will not be discovered.