The present invention relates generally to the field of computer-based information retrieval, and in particular to the field of search engines that facilitate access to data published on the Internet and intranets, specifically meta-search engines that exploit other information sources in order to provide a better answer to user queries.
Web search engines have facilitated the access to data published on the Internet and intranets. However, individual search engines are subject to certain limitations, and this has resulted in the design of meta-searchers that exploit other information sources (including those on the Web) in order to provide a better answer to user queries. Meta-searchers do not have their own document collections; instead, they forward user queries to external information sources in order to retrieve relevant data. They xe2x80x9cwrapxe2x80x9d the functionality of information sources and employ the corresponding wrappers for interaction with the remote sources.
In ordinary search engines user relevance feedback mechanisms exist, but these mechanisms assume full access to document content. However, a meta-searcher typically only receives brief document descriptions, but not the full content of documents.
In fact, certain conventional information sources (such as Library of Congress, ACM Digital Library, AltaVista, etc.) offer a xe2x80x9cFind similar itemsxe2x80x9d feature, when the user, after sending a conventional query, can request for all items xe2x80x9csimilarxe2x80x9d to a given one, selected by the user from the query answer list. Generally, the selected document is used by the source to query its internal document collection. Finding similar documents in the collection is straightforward, as the source contains full descriptions of documents and implements conventional methods for evaluating the relevance (xe2x80x9cdistancexe2x80x9d) between the given document and any other.
Unfortunately, this is not true for the meta-searcher, which, as explained, only receives a short summary of each document. Two immediate solutions are possible: full document retrieval and using similarity features of sources. In the first solution, the meta-searcher can retrieve all documents listed in the sources"" answers, analyze and re-rank them xe2x80x9cfrom scratchxe2x80x9d. However, the document downloading takes time and therefore the complete re-ranking cannot be fulfilled on-line. In the second solution, the meta-searcher can profit from xe2x80x9cFind similar itemsxe2x80x9d features of information sources by forwarding selected documents, but it can work successfully only if most sources provide such a service. Because only few existing Web sources are adapted to search for similar documents, the majority of existing meta-searchers, such as SavvySearch (www.search.com) simply report the sources"" ranks. Some others, such as MetaCrawler (www.metacrawler.com), do allow for a certain similar document search, but the xe2x80x9cmore like thisxe2x80x9d action will in fact be that of the source providing the document if that source is capable of providing such service.
Further, in meta-search engines, answers to a user query are retrieved from different information sources, but the sources"" heterogeneity disallows the direct reuse of ranking or scoring information given by these sources. Thus, one of the important issues with the meta-searching is the ranking of documents received from different sources. The number of documents relevant to a user query at one information source is often large and the sources rank documents by using different ranking methods, these methods are often protected and hidden from users, thus also from the interrogating meta-searcher. As a result, meta-searchers have difficulty unifying sources"" ranks and providing a final and unique ranking of answers delivered to the user.
Another problem is to give a user a high(er) satisfaction from the query answers. When formulating precise queries with an individual search engine, an experienced user can benefit from the features of the source""s query language, including the attribute search, Boolean constraints, proximity operators, etc. However, even for a well-prepared query, hundreds of documents may fit the query so that, to get satisfactory results, the user query often undergoes multiple refinements.
The situation becomes more complicated in a meta-searcher, where certain important aspects of the meta-searching are purposely hidden from users: which information sources are contacted for querying, how initial queries are translated into native queries, how many items are extracted from each source, how to filter out items that do not fit the user query. All this makes the relationship between a query and the answers less obvious and thus makes the query refinement more cumbersome for the user. Since all this knowledge is encoded in different steps of the query processing, it becomes a challenge for the meta-searcher to help the user in query reformulation and in getting the most relevant answers.
A user can provide feedback on a search query result, for example by selecting and unselecting documents listed in the query answer. When a user selects relevant documents from the answer list, the meta-searcher could use a simple solution by sending the selected document as a query to an information source for querying and retrieval, but this is true only when the source uses the vector-space model (VSM). In the VSM, the relevance of a similar document is determined by comparing certain document parameters and weighting the result so as to determine a vector distance of the selected document to another one. In short, in the VSM, documents are represented as a vector of keywords. Each element in a vector will have a weight on a continuous scale. A typical approach is to take a document and process it until a list of unique words remains. This list, which contains all words in the document, is filtered through an algorithm that removes words that are too common to be searched, e.g., words like xe2x80x9cthexe2x80x9d, xe2x80x9cofxe2x80x9d and xe2x80x9caxe2x80x9d are routinely filtered out. The remaining list of words is then depicted as a vector space, where each word represents a dimension. The length of the vector can be determined in a number of ways, ranging from basic algorithms which make the vector longer if a number of words occurs more often, to complex ones that take into account term frequency and inverse document frequency measures.
However, modern search engines generally use the VSM in web systems using the information retrieval technology, but a different model is used in web systems querying data in databases. This different model, called the enhanced Boolean model (EBM), where all documents in the collection, whether they satisfy the Boolean query or not, are ranked by a relevance score. In a Boolean model, documents are represented as a set. In a Boolean model set, a document is indexed by assigning a number of keywords. When a user submits a query, a similarity function will try to match the query with all documents in the index. In a strict Boolean model the similarity function will only return documents that exactly match the query given by the user. That is why most search engines use the enhanced Boolean model, which is less restrictive as it will return a list of documents that match according to a similarity percentage. No distance is determined, as there merely exists a list of ranked documents; a document with a higher ranking does then not necessarily mean that its contents is similar to that of the selected document.
Further, if the enhanced Boolean model were to be used, it could be possible to adopt the schema of learning classical Boolean queries. In machine learning, monotone Boolean queries can be efficiently learned in the polynomial time. However, assumptions imposed by the theoretical learning mechanism turn out to be too strong in the real querying, where the user cannot be forced to give feedback on each answer document or to be prohibited from altering the relevance marks on documents in successive refinements. Formally, it means the learning should use the user relevance feedback that can be both incomplete and contradictory. As may be readily understood, all existing standard learning schemas turn inefficient in such a setting.
Another possibility would be to rank documents based on a user profile, either defined statically as a set of keywords or extracted from another environment the user is working in, for example, in a multiple user-shared recommendation system like Knowledge Pump, which is described in xe2x80x9cMaking Recommender Systems Work for Organizationsxe2x80x9d by Natalie S. Glance, Damixc3xa1n Arregui and Manfred Dardenne, Proceedings of PAAM, 1999. However, this solution is limited compared to the other above-mentioned possibilities when learning from successive query refinements.
It would therefore be advantageous to at least partly overcome the above-mentioned problems and to provide a method for improving the answer relevance in meta-search engines by using query analysis. It would further be advantageous to improve the answer relevance by also using user feedback analysis.
In accordance with one aspect of the invention, there is provided a method and apparatus therefor for improving search results from a meta-search engine that queries information sources containing document collections. Initially an original query is received with user selected keywords and user selected operators. The user selected operators define relationships between the user selected keywords. A set of information sources is identified to be interrogated using the original query by performing one of: (a) receiving a set of user selected information sources, (b) automatically identifying a set of information sources, and (c) performing a combination of (a) and (b). The set of information sources identifies two or more information sources. At least one of the user selected operators of the original query that is not supported by one of the information sources in the set of information sources is translated to an alternate operator that is supported by the one of the information sources in the set of information sources. A selected one of the translated queries and the original query is submitted to each information source in the set of information sources. Answers are received from each information source for the query submitted thereto. Each set of answers received from each information source that satisfy one of the translated queries is filtered by removing the answers that do not satisfy the original query. For each filtered set of answers, a subsumption ratio of the number of filtered answers that satisfy the original query to the number of answers that satisfy the translated query is computed. Each computed subsumption ratio is used to perform one of: (d) reformulating a translated query; (e) modifying information sources in the set of information sources automatically identified at (b); and (f) performing a combination of (d) and (e).