1. Field of the Invention
The embodiments of the invention provide methods for obtaining improved text similarity measures.
2. Description of the Related Art
In data integration, data from multiple heterogeneous sources is combined and a unified view is presented to the end-user, for querying. The end user may be querying the unified view for example to gain business insights. Discovering a unified view includes identifying overlapping data sets. This is true for structured and unstructured data.
For unstructured data, the similarity between documents is computed by comparing the terms that occur in the documents. Various techniques such as probabilistic models, support vector machines, cosine similarity measures, and Kullback-Leibler (KL) divergence measures have been proposed for computing the similarity measures.
These techniques have also been extended to assess similarity between document sets. Many of these techniques involve the representation of each document or document set as a set of terms, where each term is a pair; the first element in the pair represents a token in the document, and the second element represents the frequency of occurrence of the token in the document. The token in itself could be represented at multiple levels of granularity, such as unigram token, bigram token, or a trigram token.