The World Wide Web (WWW) is a connected collection of computers offering to world-wide spread users the possibility to extract information in response to queries. The process of information extraction includes accessing information which is stored, in the format of Web pages, into a class of computer systems, called Data Sources; in order to locate such Web Pages, users rely on another class of computer systems, called Search Engines, whose specific ability is to extract those Web Pages which with higher probability contain information that is relevant to the query. Given the huge number of Web Pages and the breadth of information which is available on the World Wide Web, the spreading and success of Search Engines has increasingly grown in the last decade.
A typical search engine offers an interface where the users enter a query expression consisting of keywords, sometimes interconnected by logical operators. The search engine uses pre-calculated index structures in order to produce the results that best match with the query expression, and present their result in the form of a ranked list of elements. Every element is usually further characterized by a hyperlink pointing to a Web Page and of additional descriptions of the content of the Web Page. Elements are ranked by the search engine with an element score, and presented on the computer interface in ranking order; the computer interface comprises a given number of elements, and navigational commands on the interface allow the user to extract more elements, or to change the query, or to follow one of the links associated with one of the elements.
An exemplary search engine is the Google® search engine. Such system uses a ranking technique, called the PageRank algorithm, which gives to each Web Page a score which uses as a measure of the relevance of such Web Page a metrics related to relevance of other Web pages containing hyperlinks pointing to it. Reference is made to L. Page, et. al., “The PageRank citation ranking: Bringing order to the web”, technical report, Stanford Digital Library Technologies Project, 1998. Paper SIDL-WP-1999-1020. The PageRank algorithm has proven to be very effective for general-purpose search, i.e. for queries which have no associated, predefined domain of interest.
In addition to general-purpose search engines, a number of search engines exist which are specifically dedicated to given domains. Examples of broad domains are: travels, books, cars; examples of narrower domains are research centers within a given field or country, hospitals in a given city. Examples of domain-specific search engines for travels are: Expedia®, EasyFly® and TravelAdvisor® search engines. Domain-specific search engines outperform general purpose search engines for domain-specific queries, because they use specific knowledge about their domains of interest. In the case of travels, they can use information about travelling time, fares, connections, and so on in order to inform the user about the “best” travel combination, where “best” is further characterized according to the user indications, who may be interested in aspects such as total cost, total travel duration, desired departure and arrival times, and so on.
The background material for the method is classified into three categories, which are analyzed next. The first one is concerned with extracting the best document from a document collection where several rankings are possible; the second one concerns merging search results where each result is ranked; and the third one concerns multi-domain queries.
A vast amount of work has been performed for addressing the issue of extracting the best documents from a document collection upon which several rankings are available. Examples include Fagin's Algorithm (FA), as described in Ronald Fagin, “Combining Fuzzy Information from Multiple Systems”, Journal of Computer and System Sciences, 1999, Volume 58(1): 83-99, the Threshold Algorithm (TA) as described in Ronald Fagin, et. al., “Optimal Aggregation Algorithms for Middleware”, IBM Research Report RJ 10205, 2000, pp. 1-40, the Quick-Combine Algorithm (QA) as described in Ulrich Guntzer et. al. “Optimizing Multi-Feature Queries for Image Databases”, Proceedings of the Very Large Data Bases (VLDB) Conference, Cairo, Egypt, August 2000, pp. 419-428, and the HRJN algorithm as described in Ihab F. Ilyas et al., “Supporting top-k join queries in relational databases”, VLDB Journal, 2004, Volume 13(3); 207-221.
FA considers a collection of elements (such as textual documents) and assumes that several distinct rankings can be used for extracting the elements from the collection. Accordingly, elements can either be accessed by a sequential access, according to one of the various rankings, or by a random access, using specific information of each element, which is different in every element and therefore constitutes an element identifier; the FA algorithm assumes a computer system that can support both sequential accesses and random accesses. The aforementioned reference illustrates the FA algorithm in the case where the sequential access costs are identical and the random access costs are also all identical. Each element is associated with an overall element score, defined as a monotone aggregation function of the scores of the element in the available rankings. Then, the purpose of the FA algorithm is extracting the “top K” elements of the collection, i.e. the K elements with maximal overall element score, by minimizing the cost of extraction; instead of reading all elements from all the ranked lists. FA starts accessing elements by making sequential accesses and stops when K common elements have been found, and then performs additional random accesses in order to guarantee that the set of elements that are accessed, either by sequential or random accesses, include the “top K” elements, that can therefore be presented as the output of the FA algorithm.
In TA, sequential accesses are made to each ranked list to retrieve elements and their element score, and, for each retrieved element, a random access is made to retrieve the element score of that element on the other ranked lists so as to determine the element's overall score, which is computed via a given monotone aggregation function combining the individual element scores of the elements in the available rankings. Element retrieval is stopped as soon as there are K elements with an overall score that is below a threshold computed via the aggregation function over the element scores of the last seen elements in each ranked list. QA uses a similar idea as TA but attempts improving the global cost, by reading more elements from the less expensive rankings.
HRJN is an operator that addresses the rank-join problem, i.e., the problem of computing joins in top-k queries. It is an extension of TA to the rank-join problem, in which the goal is to compute the top K combinations of elements that match on a given subset of their properties (join attributes).
The need for modular scoring systems for merging search results in the context of document bases, of Intranets, and of the World Wide Web is the objective of the U.S. Pat. No. 7,257,577 B2, August 2007. The modular scoring system merges search results into an ranked list of results using many different features of documents. The block diagram of the high-level architecture of the modular score system, illustrated in FIG. 2, includes scoring modules based upon the indexing of textual properties of the documents (such as content, title, and anchor text) as well as processors which use generic document properties, such as their page rank; indegree; discovery date; URL words, depth, and length; and geography. For example, in one of the proposed approaches, a rank aggregation processor uses a graph method that uses as input, for every document, its position in the ranking; the algorithm operates upon collections of edges from documents to positions, where every edge <D, P> defines the cost of ranking the document D in position P. The method uses a minimum-cost perfect matching to assign a unique score to each document, thereby building a global ranking.
The problem of combining multiple ranked lists into a single ranked list is considered in the following references. In the Patent Application US 2006/0190425, August 2006, a framework for incrementally joining ranked lists, while minimizing memory constraints and disk or memory swapping costs, is presented. The framework focuses on a specific aspect of the architecture of a computer system and does not take into account the articulation of the methods and systems available on the Web. In the U.S. Pat. No. 6,728,704 B2, April 2004, a method and apparatus for merging result lists from multiple search engines is presented. The method operates on sub lists which are produced by a given query independently performed upon many search engines, and merges such sub lists into a single list by first computing the average score of the list elements, then extracting those elements from the list with highest score and simultaneously reducing the length of the list by one. The result is therefore a merged list of the sub lists extracted from every Search Engine, without modifications to individual entries of the list.
While general-purpose search engines and domain-specific search engines address the needs of many users for locating pages in the World Wide Web, they are not performing well when a user presents a multi-domain query, i.e., a query which addresses multiple domains at the same time. Such queries require information extracted from search engines and data sources relative to two or more domains, such as travels and musical events, or care centers and doctor specializations and insurance coverage.
For these queries, which are classified as multi-domain queries, specific query management methods are designed. Reference is made to Daniele Braga, et. al., “Optimization of Multi-Domain Queries on the Web”, Proceedings of the Very Large Data Bases (VLDB) Conference, Auckland, New Zealand, August 2008, pp, 562-573, where the notion of Multi-Domain Query is first introduced, and a model for their management is presented. A multi-domain query is received through a user interface and presented to Search Engines, which extract ranked lists of elements. The model presents a collection of operations for manipulating such ranked lists; operations collectively constitute a computer program that produces an answer to the multi-domain query. The aforementioned reference presents also a collection of approximate (heuristic) methods for selecting a Query Plan for a given multi-domain query, where a query plan is a well-defined chain of requests upon selected Search Engines for answering the query. The query plan selection is based upon the association of each operation to costs of execution. An important operation is the join of the ranked lists produced by two Search Engines; reference is made to Daniele Braga, et. al., “Joining the Results of Heterogeneous Search Engines”, Information Systems, Vol. 33, Issues 7-8, November-December 2008, pp. 658-680, where several sub-methods for performing such join are described; such sub-methods are used by the aforementioned model. The model is effective for giving a first approximation of the solution of the multi-domain query answering problem, but it does not provide an optimal solution, i.e., one which minimizes the cost of access to Search Engines.
The present invention describes a new algorithm that extends the FA algorithm to the context of joins between search engines, thus also providing a solution to the rank-join problem. The characteristics of FA make it possible for the present invention to the rank-join problem to determine, at query formulation time, the optimal execution strategy for a query, even when information on the distributions of the scores of the elements returned by the search engines is not available. This is particularly relevant for the context of search engines over the Internet dealt with by the present invention, where such distributions are generally unknown or, if known, they would be typically subject to change. Another important aspect regarding the present invention is that the optimal execution strategy determined by the present invention is independent of the aggregation function that is used to combine the elements scores in a global combination score. This indicates that no extra access to the search engines is required upon modifications of aggregation function, such as changes of the weights in a weighted sum. Note instead that determining an optimal execution strategy with rank-join algorithms based on TA, including HRJN, necessarily requires knowledge or assumptions on both the score distributions and the aggregation function.