A. Field of the Invention
The present invention relates generally to similarity estimation, and more particularly, to calculating similarity metrics for objects such as web pages.
B. Description of Related Art
The World Wide Web (“web”) contains a vast amount of information. Locating a desired portion of the information, however, can be challenging. Search engines catalog web pages to assist web users in locating the information they desire. Typically, in response to a user's request, the search engine returns references to documents relevant to the request.
From the search engine's perspective, one problem in cataloging the large number of available web pages is that multiple ones of the web documents are often identical or nearly identical. Separately cataloging similar documents is inefficient and can be frustrating for the user if, in response to a request, a list of nearly identical documents is returned. Accordingly, it is desirable for the search engine to identify documents that are similar or “roughly the same” so that this type of redundancy in search results can be avoided.
In addition to improving web search results, the identification of similar documents can be beneficial in other areas. For example, storage space may be reduced by storing only one version of a set of similar documents. Or, a collection of documents can be grouped together based on document similarities, thereby improving efficiency when compressing the collection of documents.
One conventional technique for determining similarity is based on the concept of sets. A document, for example, may be represented as a sub-set of words from a corpus of possible words. The similarity, or resemblance of two documents to one another is then defined as the intersection of the two sets divided by the union of the two sets. One problem with this set-based similarity measure is that there is limited flexibility in weighting the importance of the elements within a set. A word is either in a set or it is not in a set. In practice, however, it may be desirable to weight certain words, such as words that occur relatively infrequently in the corpus, more heavily when determining the similarity of documents.
Accordingly, there is a need in the art for improved techniques for determining similarity between documents.