The World Wide Web (WWW) is comprised of an expansive network of interconnected computers upon which businesses, governments, groups, and individuals throughout the world maintain inter-linked computer files known as web pages. Users navigate these web pages by means of computer software programs commonly known as Internet browsers. Due to the vast number of WWW sites, many web pages have a redundancy of information or share a strong likeness in either function or title. The vastness of the unstructured WWW causes users to rely primarily on Internet search engines to retrieve information or to locate businesses. These search engines use various means to determine the relevance of a user-defined search to the information retrieved.
The authors of web pages provide information known as metadata within the body of the document that defines the web pages. This document is typically written in, for example, hypertext markup language (HTML). A computer software product known as a web crawler systematically accesses web pages by sequentially following hypertext links (hyperlinks) from web page to web page. The crawler indexes the web pages for use by the search engines using information about a web page as provided by its address or Universal Resource Locator (URL), metadata, and other criteria found within the web page. The crawler is run periodically to update previously stored data and to append information about newly created web pages. The information compiled by the crawler is stored in a metadata repository or database. The search engines search this repository to identify matches for the user-defined search rather than attempt to find matches in real time.
A typical search engine has an interface with a search window where the user enters an alphanumeric search expression or keywords. The search engine sifts through available web sites for the search terms, and returns the search of results in the form of web pages in, for example, HTML. Each search result comprises a list of individual entries that have been identified by the search engine as satisfying the search expression. Each entry or “hit” comprises a hyperlink that points to a Uniform Resource Locator (URL) location or web page.
An exemplary search engine is the Google® search engine. An important aspect of the Google® search engine is the ability to rank web pages according to the authority of the web pages with respect to a search. The ranking technique used by the Google® search engine is the PageRank algorithm. Reference is made to L. Page, et. al., “The PageRank citation ranking: Bringing order to the web”, Technical report, Stanford Digital Library Technologies Project, 1998. Paper SIDL-WP-1999-0120. The PageRank algorithm calculates a stationary distribution of a Markov chain induced by hyperlink connectivity on the WWW. This same technique used by the PageRank algorithm applies to intranets or subsets of the WWW. Although the PageRank algorithm has proven to be useful, it would be desirable to present additional improvements.
Search engines typically face a problem with having too many results that contain the query terms. For example, the query “db2” appears in over 180,000 different URLs on one company intranet. The problem of indexing on a large corpus of text such as an intranet or the World Wide Web becomes one of ranking the many results by their importance and relevance to the query, so that the user need not peruse all of the results to satisfy an informational need.
Many different features can be used to determine the relevance or authority of a document to given query. In the case of the World Wide Web, the most successful techniques (as exemplified by Google) are a combination of indexing the content, indexing of anchortext, and use of PageRank to provide a static ordering of authority. Many techniques have been suggested for producing good results to queries, including considering the indegree in the weblink graph, TF*IDF and lexical affinity scoring techniques, and heavier weighting for terms that appear in titles or larger fonts. Some of these ranking techniques (e.g., ranking by frequency of terms in anchortext) are query-dependent, and can only be computed in response to a query. Others (e.g., PageRank) are static, and do not depend on the query that has been submitted.
There is a conflict between the desire to have a good searchable intranet and the inherent diversification of the way that information is presented using web technology. In many ways, this conflict mirrors the tensions that exist on the Internet. People want their Internet pages to be seen, and Internet implementers want their information to be discoverable. At the same time, myriad other factors such as social forces, technology limitations, and a lack of understanding of search by web developers can lead to decisions that conflict with good search results.
Intranet search is different from Internet search for several reasons: the queries asked on the intranet are different, the notion of a “good answer” is different, and the social processes that create the intranet are different from those that create the Internet. Queries on an intranet tend to be jargon-heavy and use various acronyms and abbreviations that reflect the structure of the organization or company that uses that intranet. In addition, the correct answer to a query is often specific to a site, geographic location, or an organizational division, but the user often does not make this intent explicit in the query. Context-sensitive search is a common problem for many intranets and the Internet.
A great deal of work has been done over the years to assess the effectiveness of different search techniques, but their effectiveness tends to be a function of the underlying corpus being searched and the characterization of the queries and users that are accessing the data. Each intranet is an island unto itself, reflecting the character of the organization that it represents. For this reason, what works well for the Internet may not work well for an intranet, and what works for one intranet may not work well for another. Part of this is derived from the nature of the organization. In a university intranet, desirable searching features may comprise free speech and diversity of opinion. In a corporation, desirable searching features may comprise hierarchical distribution of authority and focus upon the mission. Consequently, ranking functions need to reflect the particular value system of the organization whose data is being indexed. This suggests that customization is an important feature of an intranet search engine.
Within an organization, employees tend to fulfill a role that is consistent with their job description. Thus an employee of the marketing department may have a different need than an employee of a research division, and a lawyer may have a different need than a programmer. This suggests that ranking methods for search engines should provide personalization of ranking functions.
A simple approach to mixing different features is to apply numerical weights to the features and use a mixing function to combine these numerical weights into a single score for ranking documents. However, the scales and distribution of scores from different features can be incomparable, and it is difficult to arrive at an optimal mixing function. One approach to addressing this problem that has been suggested previously uses Bayesian probabilistic models for retrieval, treating the different scores given to documents as probabilities and merging them according to a probabilistic model. Reference is made to W. Croft, “Combining approaches to information retrieval”, Advances in Information Retrieval. Kluwer Academic Publishers, 2000; D. Hiemstra. “Using Language Models for Information Retrieval”, PhD thesis, University of Twente, Twente, The Netherlands, 2001; W. Kraaij, et. al., “The importance of prior probabilities for entry page search”, In Proc. 25th SIGIR, pages 27-34, 2002; T. Westerveld, et. al., “Retrieving web pages using content links”, URLs and anchors. In Proc. 10th TREC, pages 663-672, 2001. Although this approach has proven to be useful, it would be desirable to present additional improvements.
What is therefore needed is a system, a service, a computer program product, and an associated method for ranking scales and distributions of scores from different ranking systems based on different ranking features. The solution should be customizable to meet the needs and characteristics of a specific network, intranet, or client. The need for such a solution has heretofore remained unsatisfied.