The present invention is related to methods and apparatus for performing similarity searches in text based documents and, more particularly, to providing flexible indexing for text in order to compute similarity with a wide variety of similarity functions.
Similarity searches in text have become an important and well studied problem because of the recent proliferation of search engines which require techniques for finding closest matches to sets of documents. The amount of textual data on the world wide web has grown considerably in recent years and, as a result, the importance of such techniques continues to increase rapidly.
The similarity search problem is somewhat unique for the text domain because of the fact that there is no consensus on how similarity may be computed between pairs of documents. Some examples of functions for calculating similarity among documents are the cosine coefficient, the dice coefficient, and the jaccard coefficient, see, e.g., Frakes W. B., Baeza-Yates R. (editors), xe2x80x9cInformation Retrieval: Data Structures and Algorithms,xe2x80x9d Prentice Hall PTR, Upper Saddle River, N.J., 1992; and Salton G., McGill M. J., xe2x80x9cIntroduction to Modern Information Retrieval,xe2x80x9d McGraw Hill, N.Y., 1983, the disclosures of which are incorporated herein by reference. Different distance functions provide different orders of similarity of documents to a given target. Furthermore, not all distance functions are equally easy or difficult to calculate; some methods may require significantly more effort than others. For example, a simple function such as finding the number of words matching between two documents can be easily accomplished by a simple computation on the inverted representation of the documents. Other distance functions may require more sophisticated techniques which require the use of a vector space model. This makes similarity calculations significantly more difficult.
Accordingly, since the effect of the similarity function on indexability is a major issue in similarity text searches, a need exists for a flexible indexing method which provides for calculation of different similarity functions using the same index structure.
The present invention provides a flexible indexing method which provides for calculation of different similarity functions using the same index structure. Specifically, in accordance with the invention, a flexible indexing technique is provided which enables the computation of a number of similarity functions effectively on text data by using the appropriate meta-information which is stored along with an inverted index. That is, meta-information is provided in the inverted index so that these similarity functions may be calculated effectively by accessing only a small amount of data. For this purpose, the first step is to build the inverted index with the correct information stored at each stage of the algorithm with the inverted representation.
In the inverted representation, along with each word identifier (ID), a list of document identifiers (IDs) are stored which correspond to the documents which contain that word. In accordance with the invention, along with each document ID in the inverted representation, two pieces of information (meta-information) are stored:
(1) The length of the documentxe2x80x94this corresponds to the length in the vector space representation which may be obtained, in one embodiment, by taking the root-mean-sum of the weights on the vector space representation. For example, let the document A have weight (1, 0, 2, 0, 1, 0, 0, 3, 4, 1). Then, the length of the document is denoted by |A| and is equal to:
{square root over ((1*1)+(2*2)+(1*1)+(3*3)+(4*4)+(1*1))}={square root over (32)}
(2) The weight of the word in the vector for the corresponding document ID.
The indexing technique of the invention uses the word IDs with non-zero weight in the target document (i.e., the target document is specified by the user) in order to find the closest set of matches in the document collection. In the first phase, all the word IDs in the target are processed one by one in order to find the value of the term g(u1, v1)+g(u2, v2)+ . . . +g(un, vn), where g(ui, vi) is the sum of the similarity values based on words which are common among the target document and the document of the document collection being considered. We say that a document is relevant to the target when it has at least one word in common with the target document. The inverted index is an easy technique for enumerating all the documents which are relevant to the target by finding the union of all the document IDs which occur on at least one of the lists pointed to by the words in the target document. Let T be the target document with a weight vector which is equal to (t1, t2, . . . tn). Then, the similarity search algorithm needs to examine those components in the inverted index for which ti greater than 0. Let a(i, j) be the weight of the word i in document j. For each i, such that ti greater than 0, g(ti, a(i, j)) is added to a corresponding hash table entry. If no such hash table entry exists, then one corresponding to document j is created. At the same time, the length L(j) of document j is stored in an adjacent hash table entry whenever a new entry in the table is created.
Thus, for each hash table entry, there are two components: one corresponding to the sum of the various values of g(.,.) (let us call this entry X) and another corresponding to document length (let us call this entry Y). Further, let T be the length of the target document. Then, we compute the value of the expression F(X, Y, T) for each entry in the hash table. The hash table entry with the largest value of this computed function is finally reported.
Note that it is possible that the hash table may be too large to fit into main memory. Therefore, one solution is to divide the document collection into chunks, and find the closest match using the procedure discussed above. The closest of the matches from the different chunks may then be reported as the final match.
It is to be appreciated that given such a flexible indexing methodology according to the invention, the user may specify the similarity function that he desires to be used to compute the similarity search. That is, the user query may comprise a specification of a desired similarity function and a target document. Thus, based on the indexing methodology of the invention, a flexible similarity search may be performed.