The invention relates to methods and systems for classifying electronic documents, and in particular to systems and methods for filtering unsolicited electronic communications (spam) and detecting fraudulent online documents.
Unsolicited electronic communications, also known as spam, form a significant portion of communication traffic worldwide, affecting both computer and telephone messaging services. Spam may take many forms, from unsolicited email communications, to spam messages masquerading as user comments on various Internet sites such as blogs and social network sites. Spam takes up valuable hardware resources, affects productivity, and is considered annoying and intrusive by many users of communication services and/or the Internet.
Online fraud, especially in the form of phishing and identity theft, has been posing an increasing threat to Internet users worldwide. Sensitive identity information such as user names, IDs, passwords, social security and medical records, bank and credit card details obtained fraudulently by international criminal networks operating on the Internet, are used to withdraw private funds and/or are further sold to third parties. Beside direct financial damage to individuals, online fraud also causes a range on unwanted side effects, such as increased security costs for companies, higher retail prices and banking fees, declining stock values, lower wages and decreased tax revenue.
In an exemplary phishing attempt, a fake website (also termed a clone) may pose as a genuine webpage belonging to an online retailer or a financial institution, asking the user to enter some personal information, such as a username or password, or some financial information, e.g. credit card number, account number, or security code. Once the information is submitted by the unsuspecting user, it may be harvested by the fake website. Additionally, the user may be directed to another webpage, which may install malicious software on the user's computer. The malicious software (e.g., viruses, Trojans) may continue to steal personal information by recording the keys pressed by the user while visiting certain webpages, and may transform the user's computer into a platform for launching other phishing or spam attacks.
In the case of email spam or email fraud, software running on a user's or email service provider's computer system may be used to classify email messages as spam/non-spam (or as fraudulent/legitimate) and even to discriminate between various kinds of messages, for instance, between product offers, adult content, and Nigerian fraud. Spam/fraudulent messages can then be directed to special folders or deleted. Similarly, software running on a content provider's computer systems may be used to intercept spam/fraudulent messages posted to a website hosted by the respective content provider, and to prevent the respective messages from being displayed, or to display a warning to the users of the website that the respective messages may be fraudulent or spam.
Several approaches have been proposed for identifying spam and/or online fraud, including matching a message's originating address to lists of known offending or trusted addresses (techniques termed black- and white-listing, respectively), searching for certain words or word patterns (e.g. refinancing, Viagra®, stock), and analyzing message headers. Feature extraction/matching methods are sometimes used in conjunction with automated data classification methods (e.g., Bayesian filtering, neural networks).
Some proposed methods employ hashing to produce compact representations of electronic text messages. Such representations allow for efficient inter-message comparison, for spam or fraud detection purposes.
Spammers and online fraudsters attempt to circumvent detection by using various obfuscation methods, such as misspelling certain words, embedding spam and/or fraudulent content into larger blocks of text masquerading as legitimate documents, and altering the form and/or content of messages from one distribution wave to another. Anti-spam and anti-fraud methods employing hashing are typically vulnerable to such obfuscation, since small changes in text may produce substantially different hashes. Successful detection may therefore benefit from methods and systems capable of recognizing polymorphic spam and fraud.