1. Field of the Invention
This invention relates generally to field of indices and queries applied to a collection of documents in a database. More specifically, this invention relates to efficient expansion and processing of the queries, to reducing the size of indices used to perform the query expansion and to progressive query processing.
2. Description of the Related Art
Conventional retrieval systems, by which documents may be retrieved through the application of queries, are based on a common set of principles and methodologies of categorizing documents. Documents are normally indexed manually by subject experts or librarians using pre-specified and controlled vocabularies. Alternatively, documents can be indexed based on the words included in the documents. Users can search documents using terms from the accepted vocabularies, together with appropriate boolean operators between them. In this type of system, an exact match strategy is used. Although this approach has many advantages, such as simplicity and high precision, it suffers from the problem of word mismatch.
The problem of word mismatch in information retrieval occurs because people often use different words to describe concepts in their queries than authors use to describe the same concepts in their documents. FIG. 1 shows that words used in HyperText Markup Language (HTML) documents related to the words xe2x80x9ccarxe2x80x9d and xe2x80x9cdealerxe2x80x9d may vary from one document to another. Languages other than HTML, such as Extensible Markup Language (XML) and Standard Generalized Markup Language (SGML), may be used. If a user uses a query with the words xe2x80x9cautomobilexe2x80x9d and xe2x80x9cdealer,xe2x80x9d he or she cannot retrieve all the relevant documents due to word mismatch problems.
Query expansion has been suggested as a technique for dealing with this problem. Such an approach would expand queries using semantically similar words (e.g. synonyms or other semantically related words) and syntactically related words (e.g. words co-occurring in the same document above a certain frequency are syntactically co-occurring words) to those words in the query to increase the chances of matching words in relevant documents. When query expansion is used, the xe2x80x9ccar dealerxe2x80x9d query is expanded as follows to include terms with similar meanings:
Line 1. [(xe2x80x9ccarxe2x80x9d OR xe2x80x9cautomobilexe2x80x9d OR xe2x80x9cautoxe2x80x9d OR xe2x80x9csedanxe2x80x9d ) OR
Line 2. (xe2x80x9cFordxe2x80x9d OR xe2x80x9cBuickxe2x80x9d)] AND
Line 3. (xe2x80x9cDealerxe2x80x9d OR xe2x80x9cShowroomxe2x80x9d OR xe2x80x9cSalesOfficexe2x80x9d)
There are two types of query expansion involved in this example. The query expansions in Line 1 and Line 3 are adding additional words related to car and dealer by lexical semantics, i.e. words which are semantically similar. Automobile, auto, and sedan are words having a similar meaning to the word car. Similarly, Showroom and SalesOffice have meanings similar to the word dealer. The other type of query expansion, shown in Line 2, is by, for example, syntactical co-occurrence relationships. A large number of the words used on the World Wide Web (xe2x80x9cthe Webxe2x80x9d) are actually proper names, which cannot be found in lexical dictionaries. Examples of proper names include Ford, Buick, NBA, and National Football League. As noted above, syntactical co-occurrence relationships are derived from analysis on the frequency of two words co-occurring in the same document. This is based on the assumption that there is a higher chance that two words are related if they appear frequently together in the same document. For example, the co-occurring words with Ford could be dealer, body shop, Mustang, Escort, etc.
To support query expansion, indices of words related by lexical semantics and syntactical relationships, such as co-occurrence, need to be maintained. The indices for related words by lexical semantics can be constructed as a hierarchical structure (see e.g. W. Li et al., xe2x80x9cFacilitating Multimedia Database Exploration through Visual Interfaces and Perpetual Query Reformulations,xe2x80x9d Proceedings of the 23rd International Conference on Very Large Data Bases, pages 538-547, Athens, Greece, August 1997), a semantics network (see e.g. G. A. Miller, xe2x80x9cNouns in WordNet: A Lexical Inheritance Systemxe2x80x9d In International Journal of Lexicography 3 (4), 1990, pages 245-264), or hierarchical clusters of associated words (see e.g. G. Salton et al., xe2x80x9cThe SMART and SIRE Experimental Retrieval Systems,xe2x80x9d pages 118-155, McGraw-Hill, New York, 1983). Since syntactical relationships, such as syntactical co-occurrence relationships, are binary, the size of syntactical relationship indices can be extremely large. Some techniques have been proposed for stemming. See e.g., G. Grefenstette, xe2x80x9cUse of syntactic context to produce term association lists for text retrieval,xe2x80x9d Proceedings of the Fifteenth Annual International ACM SIGIR Conference, Denmark, 1992; J. Xu et al., xe2x80x9cQuery Expansion Using Local and Global Document Analysis,xe2x80x9d Proceedings of the 19th Annual International ACM SIGIR Conference, Zurich, Switzerland, 1996; and C. Jacquemin, xe2x80x9cGuessing Morphology from Terms and Corpora,xe2x80x9d Proceedings of the 20th Annual International ACM SIGIR Conference, Philadelphia, Pa., USA, 1997. Such techniques include analysis of occurrence frequency, and employing morphological rules (e.g. converting all words to root form) or lexical dictionaries. However, the size of indices for words associated by syntactical co-occurrence relationships is too large to search efficiently.
A substantial amount of work on the problem of word mismatch has been done in the area of information retrieval (IR). See e.g. G. Salton et al., xe2x80x9cIntroduction to Modern Information Retrieval,xe2x80x9d McGraw-Hill Book Company, 1983; G. Salton, xe2x80x9cAutomatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer,xe2x80x9d Addison-Wesley Publishing Company, Inc., 1989; and K. Sparck Jones et al., xe2x80x9cReadings in Information Retrievalxe2x80x9d Morgan Kaufinann, San Francisco, Calif., USA, 1997. However, much of the work has been directed to the study of retrieval measures such as recall and precision. Although some work has suggested ways to efficiently support query expansion (see e.g. C. Buckley et al., xe2x80x9cAutomatic Query Expansion Using SMART,xe2x80x9d Proceedings of the 3rd Text Retrieval Conference, Gaithersburg, Md., 1993) and indexing mechanisms, two problems have persisted without an acceptable solution. First, index size is extremely large since many words in a document collection (e.g. the Web) are distinct proper names and each word has a number of semantically similar and syntactically related words. Second, query processing is expensive since queries are expanded by adding additional words.
These problems get worse when dealing with document information collected from the Web since the number of documents is very large and the words used are extremely diverse, inconsistent, and sometimes incorrect (e.g., typographical errors). A study has shown that most user queries on the Web typically involve two words. See B. Croft et al., xe2x80x9cProviding Government Information on the Internet: Experiences with THOMAS,xe2x80x9d Proceedings of Digital Libraries (DL ""95), 1995. However, with query expansion, query lengths increase substantially. As a result, most existing search engines on the Web do not provide query expansion functionality.
An overview of existing work in the area of query expansion will now be presented. Query expansion has received a significant attention in the field of IR. However, the focus in the past has been to evaluate the improvements in retrieval measures, i.e., precision and recall, as a result of query expansion. Another research focus has been in the direction of building dictionaries so as to identify a set of similar terms for a given query word. However, the existing work has done little to address the problem of efficient processing of queries when they undergo query expansion or to reduce the size of the indices used to perform query expansion and processing. Furthermore, the issue of ranking documents on the basis of exact and similarity matches remains a difficult problem.
SMART is one of the best known advanced information retrieval systems. See R. T. Dattola, xe2x80x9cExperiments with a fast algorithm for automatic classification,xe2x80x9d Gerard Salton, editor, The SMART Retrieval System - Experiments in Automatic Document Processing, chapter 12, Prentice-Hall, Inc., Englewood Cliffs, N.J., 1971; and G. Salton et al., xe2x80x9cThe SMART and SIRE Experimental Retrieval Systems,xe2x80x9d supra. In SMART, each document is represented by a vector of terms. Each position of the vector represents the weight (i.e. importance) of corresponding terms in the document. For a document collection of M documents with N distinct terms, the collection is represented as an M xc3x97N matrix. A query is also represented as a vector of terms. The document retrieval is based on similarity computation of the cosine measure of the query vector and each document vector. Other well known systems include INQUERY. See J. Callan et al., xe2x80x9cTrec and tipster experiments with inquery,xe2x80x9d Information Processing and Management, 31:327-332, 1995.
Latent Semantic Indexing (LSI) is a technique which relies on statistically derived conceptual indices instead of individual term retrieval in lexical matching. See R. Harshman et al., xe2x80x9cIndexing by latent semantic analysis,xe2x80x9d Journal of the American Society of Information Science, 41:391-407, 1990; and M. W. Berry et al., xe2x80x9cComputational Methods for Intelligent Information Access,xe2x80x9d Proceedings of the 1995 ACM Conference on Supercomputing, 1995. LSI assumes that there is some hidden or latent structure in word usage, which needs to be externalized by analyzing the word occurrence in a document. Hence, documents are viewed as vectors in a very high dimensional term space and the individual elements in the vector represent the frequency of occurrence of a particular term in a given document. More sophisticated measures based on both global and local weightings can also be used. A truncated singular value decomposition (SVD) is used to estimate the structure in word usage across documents. See G. Golub et al., xe2x80x9cMatrix Computations,xe2x80x9d Johns-Hopkins, Baltimore, Second Edition, 1989. Retrieval is then performed using the database of singular values and vectors obtained from the truncated SVD. Preliminary evaluation of LSI indicates that this approach of information retrieval is a more robust measure than that based on individual terms.
Automated query expansion has long been suggested as a technique for dealing with the word mismatch issue. See e.g., E. Voorhees, xe2x80x9cQuery Expansion Using Lexical-Semantic Relations,xe2x80x9d Proceedings of the 17th Annual International ACM SIGIR Conference, Dublin, Ireland, 1994. One approach uses a thesaurus to expand the query to increase the chances of matching words in relevant documents. A study has shown that simply using a general thesaurus provides limited improvement. Id. Many advanced techniques have also been proposed. See e.g., O. Kwon et al., xe2x80x9cQuery Expansion Using Domain Adapted, Weighted Thesaurus in an Extended Boolean Model,xe2x80x9d Proceedings of the 3rd International Conference on Information and Knowledge Management, 1994; Y. Qui et al., xe2x80x9cConcept Based Query Expansion,xe2x80x9d Proceedings of the 16th Annual International ACM SIGIR Conference, Pittsburgh, Pa., USA, 1993; E. Voorhees, xe2x80x9cQuery Expansion Using Lexical-Semantic Relations,xe2x80x9d supra; and M. W. Berry et al., xe2x80x9cComputational Methods for Intelligent Information Access,xe2x80x9d supra. Based on the experimental results, automatic query expansion, on average, improve effectiveness of retrieval by 7% to 25%. See C. Buckley et al., xe2x80x9cAutomatic Query Expansion Using SMART,xe2x80x9d supra.
Alternatively, improvements can be made by including syntactically relevant words. This approach is to cluster words based on co-occurrence in documents and to use these clusters to expand queries. Since the co-occurrence is a binary relationship, the size of such index is usually extremely large. One group has proposed a technique for using corpus-based word variant co-occurrence statistics to modify or create a stemmer and has demonstrated its advantage over the approach of using only morphological rules. See W. B. Croft et al., xe2x80x9cCorpus-Specific Stemming Using Word Form Co-occurrence,xe2x80x9d Proceedings of the Fourth Annual Symposium, 1994. The above techniques that expand a query term to a set of semantically related terms are called global analysis. In query expansion, terms from relevance feedback can also be added to the subsequent query to improve the effectiveness of retrieval. See G. Salton et al., xe2x80x9cImproving retrieval performance by relevance feedback,xe2x80x9d Journal of the American Society for Information Science, 41(4):288-297, June 1990. This is called local analysis. A formal study has shown that using global analysis techniques, such as word context and phrase structure, on the local set of documents produces results that are both more effective and more predictable than simple local feedback. See J. Xu et al., xe2x80x9cQuery Expansion Using Local and Global Document Analysis,xe2x80x9d supra. Each of the references discussed herein, is hereby incorporated by reference.
However, as noted above, the past work has failed to address the problem of efficient processing of queries when they undergo query expansion or of reducing the size of the indices used to perform query expansion and processing.
The present invention provides a solution to the problem of word mismatch and resulting inefficient query processing via a method and apparatus for efficient query expansion using reduced size indices and for progressive query processing. More specifically, queries are expanded conceptually, rather than physically, using semantically similar and syntactically related words to those specified by the user in the query to reduce the chances of missing relevant documents. To support query expansion, indices on words related by lexical semantics and syntactical co-occurrence need to be maintained. Two issues become paramount in supporting such query expansion: the size of index tables and the query processing overhead. In accordance with the present invention, the notion of a multi-granularity information and processing structure is used to support efficient query expansion, which involves an indexing phase, a query processing and a ranking phase. In the indexing phase, semantically similar words are grouped into a concept which results in a substantial index size reduction due to the coarser granularity of semantic concepts. During query processing, the words in a query are mapped into their corresponding semantic concepts and syntactic extensions, using a dictionary and actual data contents, resulting in a logical expansion of the original query. Additionally, the processing overhead can be avoided. The initial query words can then be used to rank the documents in the answer set on the basis of exact, semantic and syntactic matches and can also be used to perform progressive query processing.