Text similarity measurement generally refers to methods of determining the degree of overlap between two portions of text. These methods are usually implemented during a search for a document to judge how well a document matches a given search query. However, these methods may also be used in other information retrieval problems such as document clustering (to match two documents and decide whether they should be in the same cluster) and the classification of documents.
A method for determining the similarity of portions of text is the cosine measure. For this measure, the portions of text are assumed to be represented as vectors with terms as the coordinates after appropriate weighting. An estimate of similarity is determined as the cosine of the angle between the vectors. Often the cosine is calculated as the dot product of the vectors after normalization.
The cosine measure is generally considered to have advantageous properties such as fast computation and a symmetric and simple range of values [0,1]. However, with this method it is assumed that the terms are independent and each portion of text is simply treated as a ‘bag of terms’.
As a result, the method is limited in its ability to accurately capture the degree of similarity between two portions of text.
Other methods for determining the similarity between two portions of text are available, however, they generally have the same above-mentioned disadvantage as the cosine measure.
Another type of method for determining the similarity between two portions of text uses the co-occurrence of term pairs in each of the portions. For this type of method, a co-occurrence thesaurus is first constructed from the text portions. This method captures all the co-occurring term pairs in the portions. In this type of method the similarity is determined by reference to the amount of overlap between co-occurring term pairs. Co-occurrence based measures capture some of the term dependencies, but this method can be used only if there is a sufficiently large portion of text available for the generation of a co-occurrence thesaurus or if the co-occurrence thesaurus is already available.
U.S. Pat. No. 5,297,039 (Kanaegami et.al) proposes a similarity measure using syntactic relations between the terms. Text is parsed first to extract an ‘analysis to network’ that consists of triplets of the form ‘(relation, element 1, element 2)’. The elements correspond to the nouns and the relation is a term (usually a verb) syntactically close to the elements 1 and 2. The similarity is measured by a sum of term agreements, pair agreements and line agreements between the corresponding analysis networks, after suitable weighting. Since the relations are themselves terms extracted from the text, this method does not overcome the problem of synonymity. Accordingly, it is difficult to calculate the term, pair and line agreements accurately.
The above discussion of documents, acts, materials, devices, articles or the like is included in this specification solely for the purpose of providing a context for the present invention. It is not suggested or represented that any or all of these matters formed part of the prior art base or were common general knowledge in the field relevant to the present invention as it existed before the priority date of each claim of this application.