The present invention relates generally to network and Internet search systems and more particularly to search systems that provide enhanced search functionality for ranking and enhancements based on user personalization.
With the advent of the Internet and the multitude of web pages and media content available to a user over the World Wide Web (web), there has become a need to provide users with streamlined approaches to filter and obtain desired information from the web. Search systems and processes have been developed to meet the needs of users to obtain desired information. Examples of such technologies can be accessed through Yahoo!, Google and other sites. Typically, a user inputs a query and a search process returns one or more links (in the case of searching the web), documents, and/or references (in the case of a different search corpus) related to the query. The links returned may be closely related, or they may be completely unrelated, to what the user was actually looking for. The relevance of results to the query may be in part a function of the actual query entered as well as the robustness of the search system (underlying collection system) used. Relevance might be subjectively determined by a user or objectively determined by what a user might have been looking for.
Queries that users enter are typically made up of one or more words. For example, “hawaii” is a query, so is “new york city”, and so is “new york city law enforcement”. As such, queries as a whole are not integral to the human brain. In other words, human beings do not naturally think in terms of queries. They are an artificial construct imposed, in part, by the need to query search engines or look up library catalogs. Human beings do not naturally think in terms of just single words either. What human beings think in terms of are natural concepts. For example, “hawaii” and “new york city” are vastly different queries in terms of length as measured by number of words but for a human being they share one important characteristic: they are each made up of one concept. In contrast, a person regards the query “new york city law enforcement” as fundamentally different because it is made up of two distinct concepts: “new york city” and “law enforcement”.
Human beings also think in terms of logical relationships between concepts. For example, “law enforcement” and “police” are related concepts since the police are an important agency of law enforcement; a user who types in one of these concepts may be interested in sites related to the other concept even if those sites do not contain the particular word or phrase the user happened to type. As a result of such thinking patterns, human beings by nature build queries by entering one or more natural concepts, not simply a variably long sequence of single words, and the query generally does not include all of the related concepts that the user might be aware of. Also, the user intent is not necessarily reflected in individual words of the query. For instance, “law enforcement” is one concept, while the separate words “law” and “enforcement” do not individually convey the same user intent as the words combined.
Current technologies at any of the major search providers, e.g., MSN, Google or any other major search engine site, do not understand queries the same way that human beings create them. For instance, existing search engines generally search for the exact words or phrases the user entered, not for the underlying natural concepts or related concepts the user actually had in mind. This is perhaps the most important reason that prevents search providers from identifying a user's intent and providing optimal search results and content.
A search might proceed as follows: a searcher presents a query (e.g., “new york police”) to a search engine and the search engine returns a set of hits (e.g., results, pages, documents, items, etc.) that contain terms of a query (or otherwise “match” the query). The matching process involves (a) extracting an as full as possible set of matching hits and (b) presenting top relevant hits of the extracted set (as the whole set can be very large and therefore unsuitable for presentation), i.e., ranking the hits.
Where the search results comprise a small number of items, all of the items can be presented to the user in any particular order and be considered as relevant as some other order. However, where the search results initially comprise a large number of pages, ranking, filtering and other prioritization might be called for in order that the top (highest) ranked pages be more relevant to the user intent than those that have low rank. In a specific implementation of such search results processing, pages are ranked and presented to the user in rank order from highest ranked to lowest rank, with a cut off after a certain number of hits or below a certain rank value.
Some methods exist for the ranking process, but often this is a computation-intensive process. Some approaches assign ranking values to each hit and sorts the hits by ranking value. Even within this subset of approaches, there have been proposed different methods of assigning ranking values. One approach, wherein each hit comprises a piece of content such as a Web page, is to develop “authority” values for pages, wherein a page's authority value reflects a calculated authoritativeness of the page.
With the authority values in hand, a search engine can optimize search results by ranking hits comprising the search results to better match top pages to likely user intent, e.g., relevancy. In general, a search begins with a search input such as a query string, a URL, search fields, etc., possibly also including context and/or preferences. In response to a user's search input, a search server returns search results comprising items located within the search corpus deemed to be suitable search results given the user intent for the search inferred from the search input.
Authority values for a page might be determined based on the authorities of other pages that point to that page. Pointing is often done using hyperlinks. Thus, if a highly authoritative page includes a hyperlink to a second page, that second page will increase in authority as a result. Computation of authority values using information contained in hyperlinks that connect Web pages to another pages is described in U.S. Pat. No. 6,285,999.
With authority value ranking, the ranking is determined by the pages and their links. In network terminology, these are the nodes and edges, respectively. Where a collection of items can be represented by a graph, as a collection of hyperlinked pages can, an authority vector might represent the set of authority values for a vertex of the graph.
One such type of authority vector is the page ranking vector (“PRV” herein), which is defined over a directed graph, W, of web pages such that a vector component PRV(p) represents the authority induced on a web page p by hyperlink information.
With a typical PRV computation process (“PRV process”) is an iterative process wherein the authority of each page might be uniformly transferred along its out links such that authority of a page might be equal to a sum of authorities of pages that point to it. In other words, the PRV process uses a distribution of authority weight balanced with respect to link transitions. Mathematically, this is a stationary point of a stochastic transition matrix. Let E=E(W) be an edge indicator or an adjacency matrix for a graph W, wherein Eij=1 in the matrix E if there is a link i→j between page i and page j and Eij=0 if there is not a link. Where n pages are being considered, dim(E)=n×n and n=|W|. The stochastic transition matrix P is defined as shown in Equation 1, where deg(i) is the “out degree” of a node i (In the case of Web pages, this is the number of hyperlinks in the page at node i). Given an authority vector p=(p1, p2, . . . ), a transformed vector p′=(p1, p2, . . . ) can be defined as a result of a vector-matrix multiplication shown in Equation 2.Pij=Eij/deg(i)  (Equ. 1)
                              p          j          ′                =                                            ∑                              i                ->                j                                      ⁢                                          p                i                            /                              deg                ⁡                                  (                  i                  )                                                              =                                    ∑              i                        ⁢                                          p                i                            ⁢                              P                ij                                                                        (                  Equ          .                                          ⁢          2                )            
In the PRV process, a PRV authority vector is a probability distribution over W that is a fixed point of the P. This means an authority vector is balanced—it is invariant under the transformation shown by Equation 2. Such authority vector p is a solution of the eigensystem shown in Equation 3.p=PT·p  (Equ. 3)
Under the conditions of strict connectivity and aperiodicity of the graph W, the Perron-Frobenius theorem guarantees that the simple power iteration process shown in Equation 4 converges to an eigenvectorp of Equation 3 corresponding to a simple principle eigenvalue of a matrix P. Since the matrix is stochastic (i.e., its rows sum to one), eigenvector p corresponds to a unit eigenvalue found by the simple power iterative method illustrated in Equation 4.p(k+1)=PT p(k).  (Equ. 4)
Dangling pages (defined as pages with deg(i)=0) present a clear problem for the definition in Equation 2, as a dangling page will result in a zero denominator in that equation. Matrix P is sometimes modified as shown in Equation 5, where di=1 if page i is a dangling page and di=0 otherwise, and where v is some probability distribution.P′=P+d·vT  (Equ. 5)
Vector v is interpreted as teleportation: instead of propagation along the out links (there are none), authority is instantaneously transported to all pages in proportion defined by v.
While the condition of aperiodicity is guaranteed for a web graph W, the condition of strict connectivity is routinely violated. To achieve strict connectivity, the dangling page adjustment can be generalized by adding some degree of teleportation to all the pages as illustrated by Equation 6. Coefficient c is usually picked around 0.85-0.9. If teleportation vector v=(1/n, . . . , 1/n) is uniform, strict connectivity is guaranteed.P″=cP′+(1−c)E, Eij=vj, E=(1)n×1·vT, 0<c<1  (Equ. 6)
PRV processes frequently assume a “random surfer” model of a surfer browsing along the Web who browses to a page and then, with probability c, uniformly randomly follows one of the out links on that page or with probability (1−c) teleports according to distribution v to a different page.
If N(i,t) is the number of times a random surfer visits page i over time t, according to the Ergodic theorem, the equation lim N(i,t)/t=pi is satisfied. This establishes a connection of the random surfer model with Equation 3 defining a PRV as an eigenvector of a modified transition matrix P″ and with the intuitive requirement of balanced authority.
The generalization of an original transition matrix P to P″ defined by Equation 6 is useful beyond the purely technical reason of achieving strict connectivity. For example, if instead of a uniform teleportation v, a distribution that reflects certain preferences is used (such as topical preferences), this leads to a more specific ranking of search results. While usage of non-uniform teleportation v is known, computing ranking for such teleportation was not easy. Teleportation vectors might be concentrated in a single page as illustrated in Equation 7.v=δ(h)={δih}  (Equ. 7)
In the vector-matrix multiplication of Equation 2, an original transition matrix P is sparse, but the modified matrix P″ is no longer sparse. This can be easily fixed by using the original matrix P alone and keeping track of a residual term ∥p∥−∥PT·p∥ in L1 norm.
Equation 3 expresses an eigensystem for the basic matrix, while Equation 8 expresses an eigensystem for matrix P″.p=cPT·p+(1−c)ET·p  (Equ. 8)
Different methods to accelerate the simple power iteration process shown in Equation 4 have been suggested, including extrapolation methods based on a striking result concerning second eigenvalue and block-structure methods. Typically each of these is an iterative method that engages in a kind of iterative approximation to p starting from some initial guess.
Ideally, iterative processes should converge. Different ways to estimate convergence of the iterative process exist (e.g., L1 norm or Kendall's τ). Since E·p=v when ∥p∥=∥p∥1=1, p≧0, the eigensystem illustrated by Equation 8 can be cast as a linear system described by Equation 9.p=cPT·p+(1−c)v  (Equ. 9)
A block-structure methodology can be extended to personalization by assigning some preferences to blocks. More practically, tractable topic-sensitive personalization is suggested in Haveliwala, T. H. Topic-sensitive PageRank, Proc. of the Eleventh International World Wide Web Conference (2002). Some information retrieval methodology is required to establish the link between a query and each of the topics. As a result, this approach is effectively limited to a few hundred of precomputed topical PRVs and does not scale well.
Jeh, G., Widom, J., Scaling Personalized Web Search, Technical Report, Computer Science Department, Stanford University (2002) (“Jeh and Widom”) proposed personalization based on page-specific PRVs. Correspondingly, user bookmarks with suitably configured weights naturally induce personalization. Jeh and Widom showed how a small portion of basis PRVs corresponding to hub pages H (important selected pages) facilitates computing of a general PRV at query time. Basis hub PRVs can be compressed (encoded). A so-called Hub skeleton, a relatively small data structure, is instrumental in their decoding. The developed theory is based on technical apparatus related to inverse P-distance and its modifications.
The random surfer model is not the only model for studying ordering of search results. Kleinberg [Kleinberg, J., Authoritative sources in a hyperlinked environment, Proceedings of the ACM-SIAM Symposium on Discrete Algorithms (1998) introduced a framework similar to page rank analysis that utilized a small query-specific subgraph of W. Corresponding development resulted in HITS algorithm [see, for example, David Gibson, Jon Kleinberg, Prabhakar Raghavan, Inferring Web Communities from Link Topology, Proceedings of the 9th ACM Conference on Hypertext and Hypermedia, 1998] and its variations [see, for example, S. Chakrabarti, B. E. Dom, R. K. David Gibson, P. Raghavan, S. Rajagopalan, and A. Tomkins. Spectral filtering for resource discovery. In Conference on Research and Development in IR (SIGIR'98), Melbourne, Australia, 1998].
While a number of these approaches might be useful in some ordering tasks or some number of users, they are limited in some ways, not scalable in some situations, require excessive computing power, are not specific enough, or have other shortcomings. Thus, there is a need for improved search systems that can improve upon the search experience in providing search results to a querier.