Ascertaining the similarity between two documents is useful for searching databases to find the document that best matches a query or the document most like a particular search document, where the meaning of “most like” will vary according to the application. Ascertaining similarity is also useful for removing duplicate documents from a database, for cataloging or indexing documents, and for calculating supply of similar documents or data objects. Many different approaches have been tried.
For example, the current state of the art in assessing document similarity is exemplified by an approach developed by Thomas Hofmann. Hofmann's method for learning the similarity of documents is explained in “Learning the Similarity of Documents: An Information-Geometric Approach to Document Retrieval and Categorization,” in Advances in Neural Information Processing Systems 12, S. A. Solla, T. K. Leen, and K.-R Muller, eds, pp. 914-920, MIT Press, 2000. This method uses probabilistic latent semantic analysis (PLSA) to create vectors describing documents and then measures the similarity of those vectors. As explained in “Probabilistic Latent Semantic Indexing”, in Proceedings of the 22nd International Conference on Research and Development in Information Retrieval (SIGR '99), pp. 50-57, ACM, 1999, by Thomas Hofmann, PLSA models documents as memoryless information sources (i.e., bags of words in which the importance of a word is not related to the structure of the document or the occurrence of other words in the document). The model assumes that the documents are combinations of “latent classes” or factors, each of which has a different probability distribution over words and attempts to learn two things:                (1) the set of latent factors that explains a corpus of documents by maximum likelihood estimation (this part is PLSA). After the system's parameters are learned, to assess the similarity of two documents, the documents are decomposed into their factor representations, and the system then assesses the similarity of their factor representations (using dot product to measure similarity); and        (2) the similarity of the actual words in the document (again using dot products), where the importance of each word is weighted by how well it is explained by the factors in the context of each document.        
In a supervised setting, e.g. text classification, the similarity functions can be used to create very effective classifiers, as the author demonstrates empirically. Because this method is unsupervised and uses the bag-of-words assumption (that the importance of a word is not related to the structure of the document or the occurrence of other words in the document), the derived similarity function is not able to exploit or account for application-specific features and structure of documents that make them more or less similar. It is not able to account for different nuances of “similarity” that might occur in different applications. For example, documents such as resumes have application-specific reasons to weigh a job title in a resume very heavily. As another example, a college application has application-specific reasons to weigh heavily the names of classes taken. Hofmann's system is also more difficult to train than more conventional learning approaches, such as neural networks, because of the large numbers of parameters that must be learned.
U.S. Pat. No. 5,461,698, Schwanke; Robert W., et al, METHOD FOR MODELLING SIMILARITY FUNCTION USING NEURAL NETWORK, takes a different approach. This patent describes a method of learning a similarity function that accounts for an a-priori known clustering of objects. The assignment of objects to groups must be known before learning the similarity function. The particular application area of this patent is understanding the structure of a software system composed of modules, declarations, and so on. The neural network described takes as input the raw features of three objects A, B, and C, where A and B are from the same cluster and C is outside the cluster. Through training with many such triples, the network must learn a similarity function able to predict that A and B are more similar to each other than either are to C. They derive their model incrementally using a set of classifications of the objects then a partial set of similarity judgments like “A is more like B than C is”.
This method uses discrete features (e.g. presence or absence of some name) rather than continuous variables, so the set operations make sense in its particular areas of application, assignment of an object to a category, but it is less useful if the intent is to describe similarity according to continuously varying features.
There have been two attempts to learn application-specific similarity functions in a supervised manner, given measurements of features of objects at the input and a teaching signal of similarity at the output. The first of these is described in “Feature Abstraction from Similarity Ratings: A Connectionist Approach,” by Peter M. Todd and David E. Rumelhart, Todd and Rumelhart propose a neural network solution to a long-standing problem in psychology: what feature dimensions and similarity measures do humans use when judging the similarity of pairs of objects drawn from some set? Thus, they offer a solution to the problem of how to predict human similarity ratings for stimuli from a set of physical feature measurements.
Todd and Rumelhart's model combines the strengths of geometric models of similarity (e.g. multidimensional scaling) with feature set matching. Geometric models suffer from the problem that they ignore the actual features of the stimuli being compared and cannot predict the similarity of (generalize to) previously unseen stimulus pairs, whereas featural models previously lacked feature abstraction abilities: they could not infer the stimulus feature dimensions relevant to predicting human similarity judgments.
The Todd and Rumelhart model begins with input feature measurements from each stimulus. These inputs are followed by a layer of feature abstraction units, which form weighted combinations of the input features. The abstract feature extraction layer is followed by a layer of feature comparison units, which compute, e.g., the distance between the two stimuli along each abstract feature dimension. This is followed by a stimulus similarity output unit, which produces a simple function of the abstract feature comparisons best predicting human judgments of stimulus similarity. The system is trained by presenting it with pairs of stimuli at the input and a human-provided teaching signal at the output and adjusting the weights in the network by gradient descent until the network's actual output for training pairs is close to the human-provided teaching signal. The authors demonstrate the system's successful feature abstraction on several small data sets such as kinship relationships (e.g. how similar is the term “brother” to “nephew”?) and Morse code data (e.g. how similar is the Morse code for “E” to the Morse code for “8”?).
Other attempts at solving this problem are described in “Predicting Similarity Ratings to Faces using Physical Descriptions,” Steyvers and Busey, in Computational, Geometric, and Process Perspectives on Facial Cognition: Contexts and Challenges, M. Wenger and J. Townsend (eds). Lawrence Erlbaum Associates (2000). Steyvers and Busey extend Todd and Rumelhart's metric similarity model to incorporate a nonmetric concept of similarity. Nonmetric approaches assume that similarity and distance judgment ratios are unimportant, but rather that the monotonic relationships between similarity judgments are important. That is, if a human observer says “sim(A,B)=0.5 and sim(C,D)=0.6”, all the system needs to know is that sim(A,B)>sim(C,D). Steyvers and Busey's system (similar to Todd and Rumelhart's but incorporating the nonmetric assumption) is trained on human judgments of similarity on all possible pairs of 100 faces of bald males. The model's inputs are, in this case, physical measurements of facial features (e.g. distance between the eyes).
Presently, document similarity and other kinds of data object similarity pose problems not faced in the work of Steyvers and Busey and Todd and Rumelhart; examples of such problems include:
First, the set of possible data objects to be compared is much larger (tens of thousands or millions of documents versus dozens of faces, kinship relationships, and Morse code elements). One implication of this is that, while smaller datasets can have humans decide which objects are similar to which other objects, with large numbers of data objects, having humans make the ratings is impossible.
Second, the number of input features for documents is potentially enormous (the term vector representation of a document typically contains tens of thousands of elements).
Third, both of the methods require that data objects be labeled prior to analysis by the system. Thus, they require early human intervention for labeling.
Combined, these factors conspire to make the task of exhaustive human similarity rating for pairs of stimuli impossible.