Text retrieval engines (TREs), or search engines, are used in a variety of web, intranet and desktop applications. In a typical information retrieval (IR) application, each document in a document collection is described by a set of representative keywords or phrases called “index terms.” The TRE searches the documents in the collection in response to a user query that comprises one or more of the index terms. The TRE typically returns a list of documents that best match the user query.
Most advanced information retrieval applications create an index of the documents in the collection that is to be searched. An example of such a system is the Guru search engine, which is described by Maarek and Smadja in “Full Text Indexing Based on Lexical Relations, an Application: Software Libraries,” Proceedings of the Twelfth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1989, pages 198-206, which is incorporated herein by reference.
The index typically contains, for each document, a set of index terms that appear in the document with a score assigned to each index term. A typical scoring model used in many information retrieval systems is the TF-IDF formula, described by Salton and McGill in “An Introduction to Modern Information Retrieval,” McGraw-Hill, 1983, chapter 3, pages 52-63, which is incorporated herein by reference. The score of term T for document D depends on the term frequency of T in D (denoted TF), the length of document D, and the inverse of the number of documents containing term T in the collection (inverse document frequency, denoted IDF).
Document scores are typically used to rank the search results provided by the TRE in terms of their relevance to the query terms. For example, U.S. Patent Application Publication 2004/0002973 A1, whose disclosure is incorporated herein by reference, describes a method for automatically ranking database records by relevance to a given query. A similarity function is derived from data in the database and/or queries in a workload. The similarity function is then applied to a given query and used to rank the records.
In many information retrieval applications, documents are associated with one or more categories. The user query may request that the search be limited to one category or a combination of such categories. This search mode is referred to as “category-based search.” For example, U.S. Patent Application Publication 2003/0195877 A1, whose disclosure is incorporated herein by reference, describes a search engine that displays the results of a multiple-category search according to levels of relevance of the categories to a user search query.
Several publications propose methods for performing category-based searches. For example, U.S. Pat. No. 5,826,260, whose disclosure is incorporated herein by reference, describes an information retrieval system that analyzes a user query and presents a “hit list” of documents to the user. The presented hit list displays an overall rank of a document and the contribution of each query element to the overall rank. The user can then reorder the hit list by prioritizing the contribution of individual query elements to override the overall rank, and by assigning additional weights to those contributions.
Another approach for category-based searching is described by Glover et al. in “Improving Category Specific Web Search by Learning Query Modifications,” IEEE Symposium on Applications and the Internet (SAINT 2001), San Diego, Calif., January 2001, pages 23-31, which is incorporated herein by reference. The authors describe a system that recognizes web pages of a specific category. The system learns modifications to queries that bias results toward documents in that category. Extra words or phrases are added to a user query in order to increase the likelihood that results of the desired category are ranked near the top.
In some applications, a document collection is divided into several sub-collections, and a search is defined over several such sub-collections. For example, U.S. Pat. No. 6,795,820, whose disclosure is incorporated herein by reference, describes a meta-search method conducted across multiple document collections. A multi-phase approach is employed, in which local and global statistics are dynamically exchanged between local search engines and the meta-search engine in response to a user query. The meta-search engine merges results from the individual search engines, to produce a single list of ranked results for the user.