1. Field
Embodiments of the present invention generally relate to information classification. In particular, embodiments of the present invention relate to integration of global intelligence regarding email messages and senders into the email delivery network to allow more accurate local spam identification to be performed.
2. Description of the Related Art
One of the problems arising with the proliferation of Internet and email usage, as well as other means of electronic communication, is the receiving of unwanted and unsolicited bulk messages, commonly known as “spam.” While similar to the problems associated with physical junk mail, the consequences can be much more severe. Spam can contain viruses or other software that disable or damage the receiver's computer or other electronic equipment. In addition, the volume of spam may represent a significant load on traffic handling mechanisms. For example, high volumes of email spam may negatively affect both client computer networks and the Internet itself. As a result, substantial efforts have been devoted to tracking and identifying spam in order to stop the problem at its source.
Examples of current anti-spam techniques include greylisting, use of greeting delays and use of checksum. Empirical evidence suggests that a great deal of spam is sent from applications designed specifically for spamming. Such applications appear to adopt the “fire-and-forget” methodology in which they attempt to send the spam to a large number of email addresses, but never confirm the spam is delivered or respond to failure indications by retrying as a standard-compliant email server would. This “fire-and-forget” approach is contrary to what well-behaved, Simple Mail Transfer Protocol (SMTP)-compliant MTAs do. Such well-behaved MTAs attempt retries due to SMTP being an unreliable transport and the handling of temporary failures being build into the core specification (i.e., RFC 821).
As a result of the inherent cost incurred by maintaining a retry strategy, greylisting is based on the premise that spammers will not attempt to re-send their messages. Greylisting temporarily rejects messages from unknown sender mail servers. This temporary rejection is designated with a 4xx SMTP error code that is recognized by SMTP-compliant MTAs, which then proceed to retry delivery later. Consequently, the greylisting technique's delayed acceptance of unknown email is effective in dealing with non-SMTP-conforming senders that do not retry. When spammers retry, however, they look just like regular email senders thereby circumventing the greylisting technique as such retries will ultimately be delivered once the blocking expires. In general, the greylisting method is effective in dealing with spam-sending only, non-SMTP-conforming spam senders, but it is ineffective in dealing with an infected email sender that sends a mix of both spam and clean messages or dynamic Internet Protocol (IP) addresses that are constantly reassigned to spammers and regular users. The greylisting method is also ineffective in dealing with spam sending applications that are made to be standard-compliant by, for example, retrying responsive to temporary rejection. In addition, delivery of email messages from new, legitimate, but non-standard compliant servers are delayed or even dropped by the greylisting approach.
With respect to the greeting delay technique, it delays the delivery of all messages, whether suspicious or not. The greeting delay technique is typically a delivery pause introduced by an SMTP server before it sends the SMTP greeting banner to the client. In accordance with RFC 2821, the client is supposed to wait until it has received this banner before it sends any data to the server. However, many spam-sending applications do not wait to receive this banner, and instead start sending data once the Transport Control Protocol (TCP) connection is complete. As a result, the server can detect this and drop the connection. One problem with this approach is that legitimate email senders that do not follow the SMTP specifications exactly may also be caught by this mechanism thereby resulting in loss of valid, non-spam messages.
With respect to the checksum-based filtering approach, it attempts to take advantage of the fact that often all of the messages sent by a particular spammer will be mostly identical. Such filtering approaches attempt to strip out everything that might vary between messages, such as the recipient's name or email address, reduce what remains of the message to a checksum and perform a lookup of the resulting checksum in a database which collects checksums of messages that are known or likely to be spam. This method is easily thwarted as the checksum's reputation is always behind as a result of spammers using obfuscating techniques to make their messages appear unique. The checksum clearing houses typically have difficulty keeping up with the ever changing allegedly spam-associated checksums; and even when the checksums do effectively detect known spam as a result of the delay in making the association, most such spam have already been delivered to end users' inboxes.
In view of the foregoing limitations of anti-spam techniques and the ineffectiveness of various other existing anti-spam methodologies, there is a continuing need for improved anti-spam systems and services.