Identifying related documents is a common technique on the Internet. It can help users discover new documents that are similar in design, type, or content to a known document a user has chosen. The user benefits from the knowledge of this new document and the particular website benefits from increased traffic and/or increased sales. In some cases, documents are compared by their associated text. Cosine similarity, latent semantic analysis, probabilistic latent semantic analysis, and latent Dirichlet allocation are a few methods used to identify these related documents.
However, identifying related documents is a difficult process, and improvements can always be made to the existing methods. For example, term frequency-inverse document frequency (Tf-idf), the most commonly used algorithm for weighting word occurrences in methods to identify related documents, has at least two drawbacks. First, some rare words that carry little meaning, like “perspicacity” or “splendiferous,” are assumed to be informative. Second, some common words that may be meaningful, like “airplane” or “purple,” are assumed to be uninformative. This can lead to results where misspellings or numbers containing many digits are thought to properly describe the user's current document, and are used to compare to other documents, or where meaningful words are ignored when comparing with other documents. Certain issues, such as identifying important words or handling pairs of words, must be addressed if any of the above mentioned methods are to be implemented.
What is needed are techniques that address these issues so as to provide a more accurate way to identify related documents quickly.