The present invention is related to methods and apparatus for performing similarity searches in documents and, more particularly, to performing such searches based on affinity lists.
In recent years, the importance of performing interactive searches on large collections of text has increased considerably because of the rapid increase in the size of the world wide web. Many search engines, such as Yahoo! (http://www.yahoo.com), Lycos (http://www.lycos.com) and AltaVista (http:///www.altavista.com), are available in which closest sets of matches are found to sets of keywords specified by the user. For such applications, the documents searched are represented in the form of an inverted index, see, Salton G., McGill M. J., xe2x80x9cIntroduction to Modern Information Retrieval,xe2x80x9d McGraw Hill, New York, 1983. Other access methods such as signature files exist (see, e.g., Faloutsos C., xe2x80x9cAccess Methods for Text,xe2x80x9d ACM Computing Surveys 17, Mar. 1, 1995); Frakes W. B., Baeza-Yates R. (editors), xe2x80x9cInformation Retrieval: Data Structures and Algorithms,xe2x80x9d Prentice Hall PTR, Upper Saddle River, N.J., 1992), though the inverted representation seems to have become the method of choice in the information retrieval domain. The inverted representation consists of lists of document identifiers, one for each word in the lexicon. For each word w, its list contains all the document identifiers, such that the corresponding documents contain that word. In addition, meta-information on word-frequency, position, or document length may be stored along with each identifier. For each user query, it suffices to examine the document identifiers in the inverted lists corresponding to the words in the query (or target).
Considerable correlations between words exist because of synonymity and different descriptions of the same underlying latent concepts. Thus, two documents containing very different vocabulary could be similar in subject material. Similarly, two documents sharing considerable vocabulary could be topically very different. While applying the method to search engines (which is a special application of similarity search, in which the target document contains very few words), this problem is observed in the form of retrieval incompleteness and inaccuracy. For example, while querying on cats, one may miss documents containing a description on the feline species, which do not explicitly contain the word xe2x80x9ccat.xe2x80x9d Methods exist for query extension via adhoc feedback, relevance feedback, or automatic expansion, see, e.g., Hearst M. A., xe2x80x9cImproving Full-Text Precision on Short Queries using Simple Constraints,xe2x80x9d Proceedings of the Symposium on Document Analysis and Information Retrieval, April 1996; Mitra M., Singhal A., Buckley C., xe2x80x9cImproving Automatic Query Expansion,xe2x80x9d Proceedings of the ACM SIGIR Conference 1998, pages 206-214; Rocchio J. J., xe2x80x9cRelevance feedback in Infomation Retrieval,xe2x80x9d Journal of the American Society for Information Science, 34(4), 262-280; Salton G., Buckley C., xe2x80x9cImproving Retrieval Performance by Relevance Feedback,xe2x80x9d Journal of the American Society for Information Science, 41(4):288-297, 1990; and Xu J., Croft W. B., xe2x80x9cQuery Expansion Using Local and Global Document Analysis,xe2x80x9d Proceedings of the ACM SIGIR Conference, 1996. Another well known problem is that of polysemy, in which the same word could refer to multiple concepts in the description. For example, the word xe2x80x9cjaguarxe2x80x9d could refer to an automobile, or it could refer to the cat. Clearly, the ambiguity of the term can be resolved only by viewing it in the context of other terms in the document.
A well known method for improving the quality of a similarity search in text is called Latent Semantic Indexing (LSI) in which the data is transformed into a new concept space, see, e.g., Dumais S., Furnas G., Landauer T., Deerwester S., xe2x80x9cUsing Latent Semantic Indexing to Improve Information Retrieval,xe2x80x9d Proceedings of the ACM SIGCHI 1988, pages 281-285. This concept space depends upon the document collection in question, since different collections would have different sets of concepts. Latent semantic indexing is a technique which tries to capture this hidden structure using techniques from linear algebra. The idea in LSI is to project the data into a small subspace of the original data such that the noise effects of synonymy and polysemy are removed. A more detailed description may be found in, e.g., Kleinberg J., Tomkins A., xe2x80x9cApplications of Linear Algebra in Information Retrieval and Hypertext Analysis,xe2x80x9d Proceedings of the ACM SIGMOD Conference, 1999.
LSI transforms the data from the sparse indexable representation (with the inverted index) in a very high overall dimensionality to a representation in the real space which is no longer sparse. Even though the new representation is of much lower overall dimensionality (typically about 200 or so dimensions are needed to represent the concept space), it is beyond the capacity of spatial indexing structures to handle effectively. R-Trees are discussed in, e.g., Guttman, A., xe2x80x9cR-Trees: A Dynamic Index Structure for Spatial Searching,xe2x80x9d Proceedings of the ACM SIGMOD Conference, 47-57, 1984.
Most of the existing art in improving search engine performance has been on the modification of user queries in order to improve the quality of the results. These techniques generally fall into two broad categories: (i) global document analysis, and (ii) local document analysis, see, e.g., Xu J., Croft W. B., xe2x80x9cQuery Expansion Using Local and Global Document Analysis,xe2x80x9d Proceedings of the ACM SIGIR Conference, 1996. In global analysis, the relationships of the words in the document collection are used in order to expand the queries, see, e.g., Crouch C. J., Yang B., xe2x80x9cExperiments in Automatic Statistical Thesaurus Construction,xe2x80x9d Proceedings of the ACM SIGIR International Conference on Research and Development in Information Retrieval,xe2x80x9d pages 77-88, 1992; Jing Y., xe2x80x9cAutomatic Construction of Association Thesaurus for Information Retrieval,xe2x80x9d Integr. Study Artificial Intelligence and Cognitive Science Appl. Epistomel. (Belgium), Vol. 15, No. 1-2, pp. 7-34, 1998; Qiu Y., Frei H. P., xe2x80x9cConcept Based Query Expansion,xe2x80x9d Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval, pages 160-169, 1993; and Voorhees E., xe2x80x9cQuery Expansion Using Lexical Semantic Relations,xe2x80x9d Proceedings of the ACM SIGIR International Conference on Research and Development in Information Retrieval, pages 61-69. The more popular method for query expansion is local analysis, in which the top ranked documents returned by the original query are assumed relevant, see, e.g., Buckley C., Mitra M., Walz J., Cardie C., xe2x80x9cUsing Clustering and SuperConcepts Within Smart: TREC 6,xe2x80x9d Proceedings of the TREC Conference, 1998; Mitra M., Singhal A., Buckley C., xe2x80x9cImproving Automatic Query Expansion,xe2x80x9d Proceedings of the ACM SIGIR Conference 1998, pages 206-214; Rocchio J. J., xe2x80x9cRelevance feedback in Infomation Retrieval,xe2x80x9d Journal of the American Society for Information Science, 34(4), 262-280; Salton G., Buckley C., xe2x80x9cImproving Retrieval Performance by Relevance Feedback,xe2x80x9d Journal of the American Society for Information Science, 41(4):288-297, 1990. Terms from these returned documents are then used in order to expand the user queries. In addition, boolean filters may be used in order to decide the relevance of the documents in the returned results, see, e.g., Hearst M. A., xe2x80x9cImproving Full-Text Precision on Short Queries Using Simple Constraints,xe2x80x9d Proceedings of the Symposium on Document Analysis and Information Retrieval, April 1996. These techniques are relevant to short user queries only, because for the case of very large documents the subject can be inferred only from the overall distribution of words in the initial document, and the initial query itself may have a considerable number of words which are unrelated to the document content. Furthermore, we will see that the inverted representation is not suitable from the performance perspective for large document targets. This also means that the process of feedback (which often requires repeated queries for modified targets) may be infeasible.
Accordingly, a need exists for methods and apparatus for performing similarity searches in a search engine so that the quality of results of the search engine are less sensitive to the choice of search terms.
This present invention provides methods and apparatus for performing similarity searches in a search engine so that the quality of results of the engine are less sensitive to the choice of search terms. This is accomplished by improved methodologies for performing affinity based similarity searches. Traditional search engines have the disadvantage of not being able to find documents with related search terms effectively. This invention overcomes this issue by performing an iterative search on the optimal query generation so as to create the best possible query results. It is to be appreciated that the term xe2x80x9caffinityxe2x80x9d as used herein is a measure of how likely it is that a term y will occur in a document when the term x is also present in the document. Thus, affinity is based on the relative presence of terms in the lexicon of terms of documents used in the search. The affinity between two documents is the average pairwise affinity between any pair of terms in the two documents. These concepts will be explained in greater detail below.
In the present invention, methodologies are provided which use affinity lists in order to perform query retrieval more effectively and efficiently. The invention comprises a two phase method. In the first phase, we find a threshold number k of candidate documents which are retrieved by the method. In the second phase, we calculate the affinity value to each of these k documents and report them in ranked order of affinities. The first phase of finding the k most valuable candidates is accomplished using an iterative technique on the affinity lists, which will be explained in detail below. Once these candidates have been found, the affinity to each document in the set is obtained, and the resulting documents are rank ordered by affinity to the target document.
One important advantage of using this technique over the existing art is that the invention takes into account words which are not included in the query but are indeed relevant. For example, while querying on xe2x80x9cjaguars,xe2x80x9d one may also add a word or words, such as xe2x80x9ccars,xe2x80x9d which increase the specificity and effectiveness of the query.
These and other objects, features and advantages of the present invention will become apparent from the following detailed description of illustrative embodiments thereof, which is to be read in connection with the accompanying drawings.