This invention relates to comparing similarity between documents including a string of characters. More particularly, this invention relates to determining whether two documents are similar.
Similarity between documents may be determined for various purposes and applications. For example, the similarity between documents may be determined to filter unwanted documents, to remove similar search entries in a search engine, and to search for similar files in a file system. Taking an example of filtering unwanted messages, similarity between a reference document and messages may be determined to filter or prevent propagation of documents similar to a reference document in a communication system. Unwanted messages include, for example, malicious messages. Malicious messages have evolved from a mere nuisance to instruments for committing fraud or other illegal activities. Such malicious messages include spam emails. These Internet “junk mails” take up valuable memory space on servers and other computational resources while consuming recipients' time and/or resource for their removal. The majority of spam messages are commercial advertising, although chain letters, political mailings, and other forms of non-commercial mailings are also classified as junk mails. Other malicious messages may pose more serious threat to the recipients. For example, emails or other forms of messages may be employed in illegal activities such as phishing and spoofing to extract sensitive information from the recipients.
Online messaging services such as social network services are especially vulnerable to unwanted messages because users place trust on messages sent by other users having previous social relationship with the users. To facilitate interactions between users, the online messaging services often provide effective and convenient mechanism to interchange messages between users. Such mechanisms include, among others, instant messenger (IM) services, email services, blog services, posting and commenting on posts, and other communication mechanisms. These mechanisms may also function as means to propagate malicious emails and messages to many users in a short amount of time.
Filtering of unwanted messages may be complicated by the fact that unwanted messages evolve over time as the messages propagate to other users. Each time a user accesses the unwanted messages, the user may add comments or revise the messages before sending the messages to another recipient. Moreover, the sender of the unwanted messages may intentionally vary messages transmitted to senders to avoid detection and filtering of the unwanted messages. Such variations in the unwanted messages make it difficult to detect and filter unwanted messages.