Measuring the similarity between two texts is a fundamental problem in many natural language processing and information retrieval applications. Widely used similarity measures are vector-based methods, in which the texts being compared are each represented by a term vector, where each term is associated with a weight that indicates its importance. A similarity function, such as a cosine measure (e.g., that compares the inner product of two normalized unit term vectors, or equivalently a linear kernel) or other kernel functions such as the Gaussian kernel, may be used.
Factors that decide the quality of a vector-based similarity measure include the vector operation that takes as input the term vectors and computes the final similarity score (e.g., cosine). Another factor is how these term vectors are constructed, including the term selection process and how the weights are determined. For example, a TF-IDF (term frequency-inverse document frequency) scheme for measuring document similarity may follow a bag-of-words strategy to include all the words in the document when constructing the term vectors. The weight of each term is the product of its term frequency (i.e., the number of occurrences in the document) and inverse document frequency (i.e., the number of documents in a collection that contain this term).
While simple and reasonably performant, such approaches have a number of weaknesses. For example, when more information is available, such as the position of a term in the document or whether a term is part of an anchor text, incorporating that information in the similarity measure may not be easy.