A. Field of the Invention
The present invention relates generally to document processing and, more particularly, to comparing documents to find similar or near duplicate documents.
B. Description of Related Art
There are a number of applications in which it may be desirable to be able to determine whether documents are similar or near duplicates of one another. Detecting spam email is one such application. Spam is unsolicited commercial email that is transmitted to multiple email accounts. To the receiver, spam is generally considered to be “junk email.”
In a typical spam episode, a single message is sent to thousands of email accounts. One known technique for removing spam from a network identifies spam based on its content. Thus, the network may be designed to recognize when many identical emails are being transmitted across the network. These identical emails can then be considered candidates for deletion before they arrive at the user email account.
In an effort to thwart automated spam detection and deletion, spam senders may slightly alter the text of each spam email by adding, removing, or replacing characters or superfluous sentences so as to defeat duplicate matching schemes. Thus, altered spam messages may be highly similar, but not identical, to one another.
Other applications for which similar document detection may be useful include detection of plagiarism and duplicate document detection in search engines.
Thus, there is a need in the art for techniques that can more accurately detect similar or near duplicate documents.