The Confound of Document Length in Semantic Similarity Calculation
A number of statistical methods have been developed for evaluating the semantic similarity between documents by representing each document as a vector {right arrow over (d)} in a vector space, and defining similarity, or semantic relatedness, as some function sim({right arrow over (d)}1,{right arrow over (d)}2) of the vectors corresponding to two documents. Often, this similarity metric may be the cosine of the angle between the two vectors
      (          defined      ⁢                          ⁢      as      ⁢                          ⁢                                                  d              →                        1                    ·                                    d              →                        2                                                                                          d                →                            1                                            ⁢                                                                d                →                            2                                                        )    ,but other measures (such as the city block metric, or the Euclidean distance) are used as well.
The most basic vector-based method for evaluating the semantic similarity between documents is content-vector analysis (CVA) (Salton & McGill 1983), in which one component of the document vector corresponds to each word which is considered as potentially occurring within it, and the non-zero elements of the vector contain a weighted function of the count with which a particular word occurs in the document. Other vector-space techniques, including Latent Semantic Analysis (LSA) (Deerwester et al. 1990; Landauer & Dumais 1997) and Random Indexing (RI) (Kanerva, Kristoferson, & Holst 2000; Sahlgren 2001; 2006), use dimensionality reduction to assign related words similar representations in the vector space.
One aspect of these vector-space semantic similarity estimates which has so far received little attention in the literature surrounding them is their dependence on the length of the texts to be compared. Baayen (2001) demonstrated that many estimates of lexical diversity (such as type-token ratio) are not invariant as the length of the text considered changes, and document length has a similar confounding effect on vector-based similarity scores. Two simple experiments will demonstrate this effect.
The first demonstration involves texts from the Lexile collection, a set of general fiction titles spanning a wide range of grade-school reading levels, which the author's institution licenses from the Metametrics Corporation. 400 documents were selected randomly from this collection, and truncated so that 100 included only the first 500 word tokens, 100 included only the first 1000 words, 100 included only the first 5000 words, and the final 100 included only the first 10,000 words. Two sets of similarity scores were calculated for each pair of documents in this collection (excluding duplicates).
The first set of similarity scores was created using a simple CVA model with tf*idf weighting, with log term weights and an inverse document frequency term equal to log
  (      NDocs          DocFreq      k        )for each term k, where the document frequency estimates were derived from the TASA corpus of high school level texts on a variety of academic subjects. (Thanks to Thomas Landauer for making this data available for research purposes.)
The second set of similarity scores was created using an RI model with similar parameters as those used for the CVA model. The Random Indexing model used co-occurrence of words within the same document in the TASA corpus as the basis for dimensionality reduction, and also used the TASA corpus to estimate inverse document frequency values for individual terms. Document vectors were produced as the tf*idf-weighted sum of term vectors occurring within the document, again with log weighting of both term frequencies and inverse document frequencies.
For each of these methods, cosine was used as the similarity metric.
FIGS. 1A and 1B show scatterplots of the similarity scores calculated by these methods for the Lexile data set against the variable gTypes, which is the geometric mean of the number of word types in the two documents to be compared. (The geometric mean of the number of word types was found to correlate more strongly with CVA and RI similarity scores than either the arithmetic or harmonic means.) A strong positive correlation is obvious for both CVA and RI similarity scores, with the relationship between RI similarity and gTypes approximately log-linear. (The Pearson product-moment correlation between CVA similarity and gTypes is 0.86, while the correlation between RI similarity and log(gTypes) is 0.89.)
In fact, this dependency of similarity on length is not primarily related to the discourse structure of longer texts vs. shorter texts, or anything specific to the order of words in the text at all, as is shown by a second experiment. A second collection of 400 documents (the LM data set) was generated as random collections of words, according to a unigram language model probability distribution based on the frequencies of words in the TASA corpus. The lengths of these documents were chosen to span the range of approximately 0-15,000 word tokens. As FIG. 2 demonstrates, the similarity scores between these randomly composed documents show almost exactly the same correlation with text length as observed with the fiction texts from Lexile. (On the LM data set, the correlation between CVA similarity and gTypes is 0.92, while the correlation between RI similarity and log(gTypes) is 0.94.)
The CVA and RI vectors corresponding to a document are constructed as the sum of vectors for the terms they contain, and these words only represent a sample from the vocabulary distribution which is representative of the meaning of the document in question. For the random documents in the LM data set, the distribution which is approximated in each document is the unigram distribution from which each document was constructed. As the documents increase in length, the law of large numbers indicates that their semantic vectors will converge to the mean vector of the distribution, and therefore that the similarity between vectors will converge to sim({right arrow over (d)}mean, {right arrow over (d)}mean), which in the case of cosine similarity equals 1. Even when the documents are not randomly composed, it can be assumed that document topics are not so sharply delimited in terms of their vocabulary that the vectors for two different documents will tend to converge to orthogonal vectors as they increase in length. Indeed, the results on the Lexile data set indicate that, at least for these texts, there is a large amount of general-purpose vocabulary that is common across topics, and causes document vectors to converge to a similar mean vector as they get longer. In fact, these observations hold even for methods in which the vector representing a document is not simply defined as a sum of term vectors (such as LSA). In any case, the vectors for longer documents will be more stable, whereas vectors for shorter documents will tend to vary more from that expected for a document on that topic.
This confounding of semantic similarity with length is pernicious in a number of natural language processing (NLP) applications. When these similarity scores are used as features for making predictions in applications such as text categorization (Cardoso, Cachopo, & Oliveira 2003) and document clustering (Liu et al. 2005), they are intended as measures of the topical similarity of documents, and not document length. When document length is also relevant to the task, it may be added to the model as an additional feature, but keeping conceptually distinct features independent in the model is likely to result in higher classification accuracy. Another issue is that classification accuracy is not the only criterion to be optimized in many NLP applications using semantic similarity features—in some tasks, such as automatic essay grading (Burstein 2003; Attali & Burstein 2006; Landauer, Laham, & Foltz 2003), there is an additional requirement of validity, that scores be determined on the basis of features which are plausibly related to student performance. Assigning essay scores on the basis of the relatedness of an essay to a particular topic clearly involves different claims about the writing process than assigning scores based on the length of the essay.
Pivoted Normalization
The 1990s marked a realization in the information retrieval community that the relevance rankings produced by standard metrics interacted with the length of indexed documents in a way which could be exploited to improve performance. In particular, as shown by Singhal, Buckley & Mitra (1996), the relevance of longer documents tends to be underestimated by metrics such as the cosine similarity between the document and the query, whereas the relevance of shorter documents tends to be overestimated. The method proposed by these authors to address this disparity, which has gained wide currency in the intervening decade, is known as pivoted document length normalization.
Singhal et al. begin with the observation that the foundation for many IR relevance measures is the dot product of a document vector with a query vector ({right arrow over (d)}·{right arrow over (q)}), which is then divided by a normalization term to account for two effects of longer documents: the greater number of terms, and the higher frequency of terms within the document. Common normalization terms include the cosine normalization ∥{right arrow over (d)}∥ and the number of word types in the document.
As previously mentioned, however, this normalization does not perfectly account for the effects of document length, since long documents are still deemed relevant less often than would be expected based on human judgments, and the converse is true for short documents. To rectify this problem, Singhal et al. introduce a linear transformation of the normalization term used in calculating relevance estimates:
  NewNormalization  =            (              1.0        -        s            )        +          s      ×              OldNormalization        AverageOldNormalization            
The parameter s is a slope parameter determining the magnitude of the change in the normalization term for a document of a given length, and can be determined on the basis of a set of documents with human relevance judgments, in order to minimize the discrepancy between documents' probability of relevance and probability of retrieval. Given a slope value between 0 and 1, the normalization term will be decreased for documents whose length is greater than the average value (the pivot which will cause their relevance estimates to be increased. FIG. 3A, modified from Singhal et al. (1996), illustrates the shift in retrieval probabilities which pivoted normalization is intended to yield.
Earlier work using vector-space methods of semantic similarity calculation for NLP tasks has taken note of the confounding effect of document length only sporadically and in passing. In work on the automated scoring of essays, in which features based on vector-space techniques are commonly used, some authors have calculated partial correlations of these features with essay scores given the length of the essay, because the length of the essay is well known to be a strong predictor of essay scores (Page, 1968).
Similarly, Penumatsa et al. (2006) note that the average LSA similarity score between student answers and model answers increases with the length of the student answer, so that length-dependent threshold values on similarity have to be used in assessing how close a student's answer is to the desired response.
Finally, Higgins (2007) does take note of the confound of essay length with similarity for vectors in a Random Indexing space (in the course of developing a model of student essay coherence), but does not recognize that this is a more general problem holding of all vector-space techniques.