The present invention is related to methods and apparatus for performing similarity searches in documents and, more particularly, to performing such searches based on conceptual indexing.
In recent years, the large amounts of data available on the world wide web has made-effective similarity search and retrieval an important issue. A similarity search has uses in many web applications such as search engines or in providing close matches for user queries. A related area is that of document-to-document queries, in which the target is an entire document, as opposed to a small number of words. Such a system has considerable use in recommender systems for a library or web application, in which it is desirable to find the closest matches to a document which is currently being browsed.
The similarity search has proven to pose an interesting problem in the text domain because of the unusually large size of dimensionality associated with the documents as compared to the size of the documents. For example, a typical document on the world wide web contains an average of about 60 to 80 words, out of a lexicon of about 100,000 words. Considerable correlations between words exist because of synonymity and different descriptions of the same underlying latent concepts. Thus, two documents containing very different vocabulary could be similar in subject material. While applying the typical similarity search method to search engines (which are a special case, in which the target document contains very few words), this problem is also observed. For example, while querying on cats, one may miss documents containing,a description on the feline species, which does not explicitly contain the word xe2x80x9ccat.xe2x80x9d Methods exist for query expansion by including synonyms, though these methods are strictly applicable to the short and specific queries of search engines, and are too inefficient and inaccurate to generalize to document-to-document similarity queries, in which the target is itself a document containing a multitude of concepts and subjects.
Another well known problem is that of polysemy, in which the same word could refer to multiple concepts in the description. For example, the word xe2x80x9cvirusxe2x80x9d could refer to a computer virus, or to a biological virus. Clearly, the ambiguity of the term can be resolved only by using it in the context of other terms in the document.
A well known method in the prior art for improving the quality of similarity search in text is called Latent Semantic Indexing (LSI), in which the data is transformed into a new space in which each dimension corresponds to a concept in the underlying data. This concept space depends upon the document collection in question, since different collections would have different sets of concepts. LSI is a technique which tries to capture this hidden structure using techniques from linear algebra. The idea in LSI is to project the data into a small subspace of the original data such that the noise effects, redundancies, and ambiguities are largely removed. The LSI approach is discussed in Kleinberg J., Tomkins A., xe2x80x9cApplications of linear algebra in information retrieval and hypertext analysis,xe2x80x9d Proceedings of the ACM SIGMOD Conference, 1999, the disclosure of which is incorporated by reference herein.
Note that LSI transforms the data from a sparse indexable representation, using an inverted index, to a representation in the real space which is no longer sparse. As is well known, the inverted index is an index in which a sparse database is indexed using its attributes. Even though the new representation is of much lower dimensionality (typically about 200 or so dimensions are needed to represent the concept space), this dimensionality is beyond the capacity of indexing structures such as R-Trees to handle. R-Trees are discussed in Guttman, A., xe2x80x9cR-Trees: A Dynamic Index Structure for Spatial Searching,xe2x80x9d Proceedings of the ACM SIGMOD Conference, 47-57, 1984. Thus, we have a difficult situation: if the data is represented in the original format using the inverted index, it is basically useless for performing a high quality document-to-document similarity search; on the other hand, when the data is transformed using latent semantic indexing, we have a data set which cannot be indexed effectively. Thus, if we have a very large collection of documents, we would either be reduced to using a sequential scan in order to perform conceptual similarity search, or have to do with lower quality search results using the original representation and ignoring the problems of synonymy and polysemy.
Another limitation of the prior art in performing document-to-document queries effectively is that a large fraction of the words in the document are unrelated to the general subject of the page. These words increase the noise effects associated with the target document, and reduce the likelihood of high quality search results. This is a problem that even latent semantic indexing cannot resolve very effectively.
Accordingly, a need exists for methods and apparatus for providing a high quality similarity search and a relatively low cost indexing methodology for use, for example, in accordance with document-to-document search applications.
The present invention provides methods and apparatus for performing similarity search in documents. A framework for providing conceptual similarity between documents is provided using clustering. The methodology of the invention also provides effective indexing techniques so that a similarity search can be performed on large collections of documents by accessing only a small percentage of the data. We demonstrate that our methodology performs substantially better than the standard method of using similarity search on the inverted index both in terms of quality and search efficiency.
In one aspect of the invention, a method of performing a conceptual similarity search comprises the steps of: generating one or more conceptual word-chains from one or more documents to be used in the conceptual similarity search; building a conceptual index of documents with the one or more word-chains; and evaluating a similarity query using the conceptual index. As will be explained in detail below, the present invention provides for the creation of a concept space comprising a new axis system which is defined by a set of these word-chains. Each word-chain comprises a set of closely related words (or topical vocabulary) associated with a given cluster, and the conceptual strength of document with respect to a word-chain defines how closely that document matches this vocabulary.
The word-chain generating step may comprise the steps of: initializing one or more word-chains to one or more sets of randomly selected documents; assigning one or more other documents to the one or more sets of randomly selected documents; concatenating the one or more documents in each set and removing less frequently occurring words from each word-chain; and merging the word-chains. The initializing, assigning, concatenating and merging steps are then iteratively repeated to generate a final set of word-chains.
The index building step may comprise the steps of: for each word-chain, finding the one or more documents with conceptual similarity to the word-chain; and retaining a list of identities of the one or more documents which have conceptual similarity not less than a predefined threshold value. The document identity preferably comprises a unique integer value.
The evaluating step preferably returns one or more of the closest documents resulting from the search; one or more matching word-chains in the one or more documents; and one or more matching topical words of the one or more documents. Particularly, the evaluating step may comprise the step of generating a conceptual representation of a target document associated with the similarity query. In one embodiment, the target document conceptual representation generating step comprises the steps of: calculating a similarity measure between the target document and each conceptual word-chain; determining whether each similarity measure is not less than a predetermined threshold value; and generating conceptual strength measures by respectively setting a conceptual strength measure to a similarity measure minus the predetermined threshold value, when the similarity measure is not less than a predetermined threshold value. The evaluating step may then comprise the step of finding a substantially close match to a target document among a plurality of indexed documents using the conceptual representation of the target document. In one embodiment, the finding step may comprise the steps of: finding one or more concepts in the target document; evaluating an inverted list associated with the indexed documents to find the one or more documents which have at least one concept in common with the target document; calculating a conceptual cosine of the one or more common concept documents to the target document; finding the closest document to the target document based on the conceptual cosine; and reporting an output statistic between the closest matching document and the target document. The closest document finding step may further comprise the steps of: reporting concepts which are present in the target document and the closest matching document; and finding a topical vocabulary which is common to the target document and the closest matching document and matching word-chains.
It is to be appreciated that, in accordance with the present invention, we use the inverted index in the context of an information retrieval application in which the sparsity of the data set arises out of the small.percentage of words from the entire lexicon which are present in a given document. In the inverted representation, for each word (or attribute for the text application), we have a list of document identifications (IDs) which correspond to all the documents which contain that word. Thus, there are as many lists as the number of words in the lexicon, and the length of a list corresponds to the number of documents which contain that word.