Many search engines allow users to search for information within a corpus of information. A corpus of information may be a collection of objects, such as documents, that have various features, such as the terms or words of a document. To search for objects of interest, a user submits to the search engine a search request (also referred to as a “query”) that includes search terms. The search engine identifies those objects within its collection that may be related to those search terms. The search engine then provides the identification of those objects to the user as the search result. The quality of a search engine depends, in large part, on effectiveness of the search engine in identifying objects that are related to the search terms.
Search engines and many other computer applications, such as text categorization and document clustering tools, rely on a similarity metric to indicate the similarity between two items, such as documents. For example, a search engine may allow a user to select a document from a search result and to request to be provided with similar documents. As another example, a search engine, when conducting a search, may want to identify terms that are similar to (or synonyms of) the search terms provided by the user. When the search request includes the word “building,” the search engine may want to search based on the additional terms of “structure” and “construction.” When selecting a category for a document, a document categorization tool may calculate the similarity between that document and the documents in each category and select the category that has the most similar documents. When clustering documents, a document clustering tool may calculate the similarity between each pair of documents and identify clusters based on the similarities. Many of these applications calculate the similarity between objects using a cosine similarity metric. To calculate cosine similarity, each object is represented by a feature vector of features derived from the object. For example, a document may be represented by a feature vector indicating keywords that the document contains. A feature vector may have a dimension for each possible feature. For example, if there are 50 predefined keywords, then the feature vector has a dimension of 50 (although many documents may contain only a small fraction of the keywords). Cosine similarity measures the cosine of the angle in multi-dimensional space between feature vectors. The smaller the angle, the more similar the objects are assumed to be.
Cosine similarity assumes that the multi-dimensional space is orthogonal in that each of the features is assumed to be independent. In many practical applications, the features are, however, not independent and different features are interrelated. For example, when the features are keywords of documents, one keyword may have substantially the same meaning as another term (i.e., synonymy) or one term may have many different meanings depending on its context (i.e., polysemy). Thus, the multi-dimensional feature space is non-orthogonal in many instances. Because of the assumed independence of features, cosine similarity may not be an accurate reflection of similarity between documents that use different, but synonymous, terms, or between documents that use the same term, but with different meanings.
Various algorithms, such as Latent Semantic Indexing (“LSI”), have attempted to address the non-orthogonal problem by projecting the feature vectors into an orthogonal space. LSI attempts to identify the conceptual content of documents using a technique known as singular value decomposition, which results in an orthogonal space for the concepts. LSI then applies cosine similarity to the feature vectors of the identified concepts. Although LSI can produce acceptable similarity scores, LSI is computationally expensive and thus infeasible for large collections of objects.
It would be desirable to have an algorithm for measuring the similarity between objects with a high degree of accuracy, such as that of LSI, but with a low computational expense, such as that of cosine similarity.