The advent of global communications networks such as the Internet has presented commercial opportunities for reaching vast numbers of potential customers. Electronic messaging, and particularly electronic mail (“email”), is becoming increasingly pervasive as a means for disseminating unwanted advertisements and promotions (also denoted as “spam”) to network users.
The Radicati Group, Inc., a consulting and market research firm, estimates that as of August 2002, two billion junk e-mail messages are sent each day—this number is expected to triple every two years. Individuals and entities (e.g., businesses, government agencies) are becoming increasingly inconvenienced and oftentimes offended by junk messages. As such, spam is now or soon will become a major threat to trustworthy computing.
Common techniques utilized to thwart spam involve the employment of filtering systems/methodologies. One proven filtering technique is based upon a machine learning approach. Machine learning filters assign to an incoming message a probability that the message is spam. In this approach, features typically are extracted from two classes of example messages (e.g., spam and non-spam messages), and a learning filter is applied to discriminate probabilistically between the two classes. Since many message features are related to content (e.g., whole words and phrases in the subject and/or body of the message), such types of filters are commonly referred to as “content-based filters”. These types of machine learning filters usually employ exact match techniques in order to detect and distinguish spam messages from good messages.
Unfortunately, often spammers can fool conventional machine learning and/or content-based filters by modifying their spam messages to look like good mail or to include a variety of erroneous characters throughout the message to avoid and/or confuse character recognition systems. Thus, such conventional filters provide limited protection against spam.