Electronic mail (e-mail) is one of the most commonly used applications for distributed computer networks. E-mail refers to the transmission of messages, which may include further messages and/or files as attachments, by computer from one person to another. E-mail provides better connectivity and fast communication between network users. If a person is either unavailable or unwilling to pick up a message immediately, the message is stored until that person can review the stored message at a later time. E-mail messages also provide a quick and easy way to package information such as sales reports, graphics, and other data for transfer to another user by simply attaching the information to the message. Business users increasingly rely on e-mail messages to share ideas, transmit documents, schedule meetings, and perform a multitude of other everyday tasks.
Due to technological improvements in the past decade, there has been a dramatic increase in the number of unwanted, and potentially destructive, e-mails (referred to as “spam”) received by users throughout the world. Spam is commonly unsolicited and blindly directed to individuals with a commercial or malicious intent. It is estimated that over 50% of all e-mail traffic is spam, causing excessive congestion in the e-mail network and Internet Service Providers (ISPs) and other organizations to spend resources processing and managing unwanted traffic. Worse, some spammers use e-mail as a preferred modality for distributing computer worms and viruses or message recipients as relay hosts to send spam to destinations maintained by the message recipient. Spam frequently obfuscates or spoofs header information to hide the e-mail source. To combat this electronic epidemic, laws, guidelines, and technological safeguards have been implemented in the past few years to reduce the number of unsolicited e-mails received by computer users.
Spam filters are one safeguard that identifies spam. Spam filters typically apply fixed, trainable, or user configurable rule sets, such as heuristic content analysis, to identify e-mail having at least a selected probability of being spam. Spam filters often perform the analysis on e-mail before the messages are passed from the gateway to the e-mail servers of the enterprise network.
Spam filters can apply one or more layers of analysis.
A first analytical layer looks for messages that originate from invalid computer domains. This type of origination information indicates that the senders are not legitimate or the address has been forged. E-mails that fit this category are rejected at the server level.
A second analytical layer compares the sender's address against a list of known spammers on a “Registered Black List” or RBL. E-mail messages from known spammers are thereby rejected.
A third analytical layer scans the headers and bodies of e-mails for one or more keywords, attachments, links, formats, and/or source addresses indicative of spam. For example, e-mail address mismatches (when the from address does not match the domain address of the server that sent the message), use of random, typically all upper case, characters or other key words in the subject line or message body, or inclusion of one or more forwards, one or more or “opt-out links, one or more “click here” links, a graphical image or active HTML script, or re-direct can all be indicative of spam. Based on the identified spam indicia (or rule matches/violations) ratings or scores are assigned to the e-mail. The scores represent a probability that the e-mail is or is not spam. E-mails having scores indicative of spam are quarantined in a spam folder, flagged, highlighted or otherwise indicated as being likely spam (e.g., by modifying the subject line of the e-mail), and/or automatically rejected or deleted. Whitelists, or lists of sources from whom the user desires to receive e-mail, can also be applied to circumvent e-mail analysis for e-mails from wanted sources.
Some spam filters use a Bayesian content filter applying advanced statistical techniques to provide greater spam detection accuracy. Bayesian filters can be trained by each user simply by categorizing each received e-mail as either spam or non-spam. After the user has categorized a few e-mails, the filter can begin to make this categorization by itself and usually with a very high level of accuracy. If the filter makes a mistake, the user re-categorizes the e-mail, and the filter learns from the re-categorization.
Despite these sophisticated spam detection methodologies, there is still a relatively high rate of false-positive (in which an e-mail is improperly identified as spam) and false-negative (in which an e-mail is improperly identified as non-spam) rates. An example of where existing spam filters fails is the registration process at newly accessed websites. Typically, a new website will require one to provide an e-mail address when he or she registers for later authentication or validation of the identity of the user. Once provided, a short e-mail is typically sent to the user, with a password or personalized URL. However, these short e-mails are frequently detected as false-positives by existing spam filters. The e-mails tend to be too short for reliable analysis of the words and come from a domain with which the user has had no prior association.