The advent of global communications networks such as the Internet has presented commercial opportunities for reaching vast numbers of potential customers. Electronic messaging, and particularly electronic mail (“email”), is becoming increasingly pervasive as a means for disseminating unwanted advertisements and promotions (also denoted as “spam”) to network users.
The Radicati Group, Inc., a consulting and market research firm, estimates that as of August 2002, two billion junk e-mail messages are sent each day—this number is expected to triple every two years. Individuals and entities (e.g., businesses, government agencies) are becoming increasingly inconvenienced and oftentimes offended by junk messages. As such, spam is now or soon will become a major threat to trustworthy computing.
A key technique utilized to thwart spain is employment of filtering systems/methodologies. One common filtering technique is based upon a hashing approaching. Hashing in the email filtering domain refers to the process of screening messages by comparing them to a database of known spam. Any message that matches a message from the database is considered spam and moved to a junk folder or deleted. Hashing requires the database of known spam to be updated frequently by reporting mechanisms such as user complaints (e.g., “this is junk” reporting), honeypots (e.g., accounts set up to attract spam), and related user complaint methods.
Unfortunately, these reporting mechanisms have several flaws. First, messages that are actually good may end up getting reported due to user error, or when large senders do not appropriately debounce their lists: a user subscribes to a bulk mailing from a large sender; their account is deactivated, perhaps because they change ISPs; the original ISP randomly selects the now-deactivated account to use as a honey-pot; and all future correspondence from the large sender to this account ends up in a database of spam. Second, some messages are considered good by some users but spam by others (e.g., opt-in commercial mailings or news letters that some users forget they signed up for and thus report as junk). A related problem is that hashing algorithms are not perfect, and good messages sometimes match to spam in the database simply by accident.
For all of these reasons, hashing systems usually require that an email match some minimum number of messages in the database before considering it spam (e.g., they might require that there be 10 matching messages in the database before they move the message to a junk folder, and 100 before they delete the message). Unfortunately, this method is still error prone, because it cannot distinguish between a spammer who sends 1,000 messages, has a 10% complaint rate (100 messages in the spam database) and a legitimate commercial mailer who sends 100,000 messages and gets a 0.1% complaint rate (100 messages in the spam database).
Furthermore, spammers can use techniques to change almost any aspect of their messages, and even relatively modest changes to a message can cause it to not match any of the spam in the database. For instance, a menu attack constructs a message by randomly choosing words (or phrases or sentences) from a series of lists of words (or phrases or sentences) with equivalent meaning so that each message is unique. Other methods for avoiding hashing algorithms include: miss-spelling words, encoding them with HTML character encodings, inserting garbage into them (e.g., a — or a _), adding random words or sentences (chaff) to the message, breaking words with HTML comments, etc.
As can be seen, many spammers continue to find ways to disguise their identities to avoid and/or bypass spam filters despite the onslaught of such spam filtering techniques.