The present invention relates generally to computer-based information retrieval, and more particularly to a system and method for searching databases of electronic text.
The commercial potential for information retrieval systems that can query unstructured text or multimedia collections with high speed and precision is enormous. In order to fulfill their potential, collaborative knowledge based systems like the World Wide Web (WWW) must go several steps beyond digital libraries, in terms of information retrieval technology. In order to do so, unstructured and heterogeneous bodies of information must be transformed into intelligent databases, capable of supporting decision making and timely information exchange. The dynamic and often decentralized nature of a knowledge sharing environment requires constant checking and comparison of the information content of multiple databases. Incoming information may be up-to-date, out-of-date, complementary, contradictory or redundant with respect to existing database entries. Further, in a dynamic document environment, it is often necessary to update indices and change or eliminate dead links. Moreover, it may be desirable to determine conceptual trends in a document set at a particular time. Additionally, it can be useful to compare the current document set to some earlier document set in variety of ways.
As it is generally known, information retrieval is the process of comparing document content with information need. Currently, most commercially available information retrieval engines are based on two simple but robust metrics: exact matching or the vector space model. In response to an input query, exact-match systems partition the set of documents in the collection into those documents that match the query and those that do not. The logic used in exact-match systems typically involves Boolean operators, and accordingly is very rigid: the presence or absence of a single term in a document is sufficient for retrieval or rejection of that document. In its simplest form, the exact-match model does not incorporate term weights. The exact-match model generally assumes that all documents containing the exact term(s) found in the query are equally useful. Information retrieval researchers have proposed various revisions and extensions to the basic exact-match model. In particular, the “fuzzy-set” retrieval model (Lopresti and Zhou, 1996, No. 21 in Appendix A) introduces term weights so that documents can be ranked in decreasing order relative to the frequency of occurrence of those weighted terms.
The vector space model (Salton, 1983, No. 30 in Appendix A) views documents and queries as vectors in a high-dimensional vector space, where each dimension corresponds to a possible document feature. The vector elements may be binary, as in the exact-match model, but they are usually taken to be term weights which assign “importance” values to the terms within the query or document. The term weights are usually normalized. The similarity between a given query and a document to which it is compared is considered to be the distance between the query and document vectors. The cosine similarity measure is used most frequently for this purpose. It is the normal inner product between vector elements:       cos    ⁡          (              q        ,                  D          i                    )        =                              w          q                ·                  w                      d            i                                                                    w            q                    ⁢                                                      ⁢                      w                          d              i                                                    =                            ∑                      j            =            1                    p                ⁢                                   ⁢                              w                          q              j                                ⁢                      w                          d                              i                ⁢                                                                   ⁢                j                                                                                      ∑                          j              =              1                        p                    ⁢                                    w                              q                j                            2                        ⁢                                          ∑                                  j                  =                  1                                p                            ⁢                              w                                  d                                      i                    ⁢                                                                                   ⁢                    j                                                  2                                                        where q is the input query, Di is a column in term-document matrix, wqj is the weight assigned to term j in the query, wdj is the weight assigned to term j in document i. This similarity function gives a value of 0 when the document and query have no terms in common and a value of 1 when their vectors are identical. The vector space model ranks the documents based on their “closeness” to a query. The disadvantages of the vector space model are the assumed independence of the terms and the lack of a theoretical justification for the use of the cosine metric to measure similarity. Notice, in particular, that the cosine measure is 1 only if wqj=wdj. This is very unlikely to happen in any search, however, because of the different meanings that the weights w often assume in the contexts of a query and a document index. In fact, the weights in the document vector are an expression of some statistical measure, like the absolute frequency of occurrence of each term within a document, whereas the weights in the query vector reflect the relative importance of the terms in the query, as perceived by the user.
For any given search query, the document that is in fact the best match for the actual information needs of the user may employ synonyms for key concepts, instead of the specific keywords entered by the user. This problem of “synonymy” may result in a low similarity measure between the search query and the best match article using the cosine metric. Further, terms in the search query have meanings in the context of the search query which are not related to their meanings within individual ones of the documents being searched. This problem of “polysemy” may result in relatively high similarity measures for articles that are in fact not relevant to the information needs of the user providing the search query, when the cosine metric is employed.
Some of the most innovative search engines on the World Wide Web exploit data mining techniques to derive implicit information from link and traffic patterns. For instance, Google and CLEVER analyze the “link matrix” (hyperlink structure) of the Web. In these models, the weight of the result rankings depends on the frequency and authority of the links pointing to a page. Other information retrieval models track user's preferences through collaborative filtering, such as technology provided by Firefly Network, Inc., LikeMinds, Inc., Net Perceptions, Inc., and Alexa Internet, or employ a database of prior relevance judgements, such as technology provided by Ask Jeeves, Inc. The Direct Hit search engine offers a solution based on popularity tracking, and looks superficially like collaborative filtering (Werbach, 1999, No. 34 in Appendix A). Whereas collaborative filtering identifies clusters of associations within groups, Direct Hit passively aggregates implicit user relevance judgements around a topic. The InQuery system (Broglio et al, 1994, No. 8 in Appendix A; Rajashekar and Croft, 1995, No. 29 in Appendix A) uses Bayesian networks to describe how text and queries should be modified to identify relevant documents. InQuery focuses on automatic analysis and enhancement of queries, rather than on in-depth analysis of the documents in the database.
While many of the above techniques improve search results based on previous user's preferences, none attempts to interpret word meaning or overcome the fundamental problems of synonymy, polysemy and search by concept. These are addressed by expert systems consisting of electronic thesauri and lexical knowledge bases. The design of a lexical knowledge base in existing systems requires the involvement of a large teams of experts. It entails manual concept classification, choice of categories, and careful organization of categories into hierarchies (Bateman et al, 1990, No. 3 in Appendix A; Bouad et al, 1995, No. 7 in Appendix A; Guarino, 1997, No. 14 in Appendix A; Lenat and Guha, 1990, No. 20 in Appendix A; Mahesh, 1996, No. 23 in Appendix A; Miller, 1990, No. 25 in Appendix A; Mahesh et al, 1999, No. 24 in Appendix A; Vogel, 1997 and 1998, Nos. 31 and 32 in Appendix A). In addition, lexical knowledge bases require careful tuning and customization to different domains. Because they try to fit a preconceived logical structure to a collection of documents, lexical knowledge bases typically fail to deal effectively with heterogeneous collections such as the Web. By contrast, the approach known as Latent Semantic Indexing (LSI) uses a data driven solution to the problem of lexical categorization in order to deduce and extract common themes from the data at hand.
LSI and Multivariate Analysis
Latent Semantic Analysis (LSA) is a promising departure from traditional models. The method attempts to provide intelligent agents with a process of semantic acquisition. Researchers at Bellcore (Deerwester et al, 1990, No. 10 in Appendix A, U.S. Pat. No. 4,839,853; Berry et al, 1995, No. 5 in Appendix A; Dumais, 1991, No. 11 in Appendix A; Dumais et al, 1998, No. 12 in Appendix A) have disclosed a computationally intensive algorithm known as Latent Semantic Indexing (LSI). This is an unsupervised classification technique based on Singular Value Decomposition (SVD). Cognitive scientists have shown that the performance of LSI on multiple-choice vocabulary and domain knowledge tests emulates expert essay evaluations (Foltz et al, 1998, No. 13 in Appendix A; Landauer and Dumais, 1997, No. 16 in Appendix A; Landauer et al., 1997, 1998a and 1998b, Nos. 17, 18 and 19 in Appendix A; Wolfe et al, 1998, No. 36 in Appendix A). LSI tries to overcome the problems of query and document matching by using statistically derived conceptual indices instead of individual terms for retrieval. LST assumes that there is some underlying or latent structure in term usage. This structure is partially obscured through variability in the individual term attributes which are extracted from a document or used in the query. A truncated singular value decomposition (SVD) is used to estimate the structure in word usage across documents. Following Berry et al (1995), No. 5 in Appendix A, let D be a m×n term-document or information matrix with m>n, where each element dij is some statistical indicator (binary, term frequency or Inverse Document Frequency (IDF) weights—more complex statistical measures of term distribution could be supported) of the occurrence of term i in a particular document j, and let q be the input query. LSI approximates D asD′=UλΛkVkTwhere Λ=diag(λ1, . . . λk), and {λi, i=1,k} are the first k ordered singular values of D, and the columns of Uk and Vk are the first k orthonormal eigenvectors associated with DDT and DTD respectively. The weighted left orthogonal matrix provides a transform operator for both documents (columns of D′) and q:VλT=(Λ−1UT)kD′  (1)α=(Λ−1U1)λqThe cosine metric is then employed to measure the similarity between the transformed query α and the transformed document vectors (rows of Vk) in the reduced k-dimensional space.
Computing SVD indices for large document collections may be problematic. Berry et al (1995), No. 5 in Appendix A, report 18 hours of CPU time on a SUN SPARC 10 workstation for the computation of the first 200 largest singular values of a 90,000 terms by 70,000 document matrix. Whenever terms or documents are added, two alternatives exist: folding-in new documents or recomputing the SVD. The process of folding-in documents exploits the previous decomposition, but does not maintain the orthogonality of the transform space, leading to a progressive deterioration in performance. Dumais (1991), No. 11 in Appendix A, and O'Brien (1994), No. 26 in Appendix A, have proposed SVD updating techniques. These are still computationally intensive, and certainly unsuitable for real-time indexing of databases that change frequently. No fast updating alternative has been proposed for the case when documents are removed.
Bartell et al. (1996), No. 2 in Appendix A, have shown that LSI is an optimal special case of multidimensional scaling. The aim of all indexing schemes which are based on multivariate analysis or unsupervised classification methods is to automate the process of clustering and linking of documents by topic. An expensive precursor was the method of repertory hypergrids, which requires expert rating of knowledge chunks against a number of discriminant traits (Boose, 1985, No. 6 in Appendix A; Waltz and Pollack, 1985, No. 33 in Appendix A; Bernstein et al., 1991, No. 4 in Appendix A; Madigan et al, 1995, No. 22 in Appendix A). Unfortunately, experience with automated techniques has shown that the user cannot readily associate transform axes with semantic meaning. In particular, open statistical issues in LSI are: (i) determining how many eigenvectors one should retain in the truncated expansion for the indices; (ii) determining subspaces in which latent semantic information can be linked with query keywords; (iii) efficiently comparing queries to documents (i.e., finding near neighbors in high-dimension spaces); (iv) incorporating relevance feedback from the user and other constraints.
For these reasons, it would be desirable to have an information retrieval system which addresses the various shortcomings of existing systems, including problems associated with the synonymy, polysemy, and term weighting limitations of those existing systems which employ the cosine metric for query to document comparisons.