1. Field
Embodiments of the present invention generally relate to network security systems such as firewalls and filters or other devices used in such systems for identifying and filtering unwanted e-mail messages or “spam,” and in particular to a methods and systems for classifying e-mail messages as spam by standardizing, tuning, and then, combining outputs of a plurality of spam classifiers or classification tools using a fuzzy logic voting algorithm or formula.
2. Description of the Related Art
The use of the Internet and other digital communication networks to exchange information and messages has transformed the way in which people and companies communicate. E-mail or electronic mail is used by nearly every user of a computer or other electronic device that is connected to a digital communication network, such as the. Internet, to transmit and receive messages, i.e., e-mail messages. While transforming communications, the use of e-mail has also created its own set of issues and problems that must be addressed by the information technology and communications industries to encourage the continued expansion of e-mail and other digital messaging.
One problem associated with e-mail is the transmittal of unsolicited and, typically, unwanted e-mail messages by companies marketing products and services, which a recipient or addressee of the message must first determine is unwanted and then delete. The volume of unwanted junk e-mail message or “spam” transmitted by marketing companies and others is increasing rapidly with research groups estimating that spam is increasing at a rate of twenty percent per month. Spam is anticipated to cost corporations in the United States alone millions of dollars due to lost productivity. As spam volume has grown, numerous methods have been developed and implemented in an attempt to identify and filter or block spam before a targeted recipient or addressee receives it. Anti-spam devices or components are typically built into network firewalls or Message Transfer Agents (MTAs) and process incoming (and, in some cases, outgoing) e-mail messages before they are received at a recipient e-mail server, which later transmits received e-mail messages to the recipient device or message addressee.
Anti-spam devices utilize various methods for classifying or identifying e-mail messages as spam including, but not limited to: domain level blacklists and whitelists, heuristics engines, statistical classification engines, checksum clearinghouses, IP and/or other reputation, message signatures, sender behavior analysis, “honeypots,” and authenticated e-mail. New methods are developed on an ongoing basis as spam continues to change and evolve. Each of these methods may be used individually or in various combinations. While providing a significant level of control over spam, existing techniques of identifying e-mail messages as spam often do not provide satisfactory results. For example, some techniques are unable to accurately identify all spam, and it is undesirable to fail to identify even a small percentage of the vast volume of junk e-mail messages as this can burden employees and other message recipients. On the other hand, some spam classification techniques can inaccurately identify a message as spam, and it is undesirable to falsely identify messages as junk or spam, i.e., to issue false positives, as this can result in important or wanted messages being blocked and lost or quarantined and delayed creating other issues for the sender and receiver of the messages. Hence, there is a need for a method of accurately identifying and filtering unwanted junk e-mail messages or spam that also creates no or few false positives.
As an example of deficiencies in existing spam filters, sender blacklists are implemented by processing incoming e-mail messages to identify the source or sender of the message and then, operating to filter all e-mail messages originating from a source that was previously identified as a spam generator and placed on the list, i.e., the blacklist. Spam generators often defeat blacklists because the spam generators are aware that blacklists are utilized and respond by falsifying the source of their e-mail messages so that the source does not appear on a blacklist. There are also deficiencies in heuristics, rules, and statistical classification engines. Rules or heuristics for identifying junk e-mails or spam based on the informational content of the message, such as words or phrases, are fooled by spam generators when the spam generators intentionally include content that makes the message appear to be a non-spam message and/or exclude content that is used by the rules as indicating spam. Spam generators are able to fool many anti-spam engines because the workings of the engines are public knowledge or can be readily reverse engineered to determine what words, phrases, or other informational content is used to classify a message as span or, in contrast, as not spam.
In attempt to better classify e-mail messages, spam classification systems have been implemented that apply multiple spam classification tools to each message. Unfortunately, these combined tool systems have not been able to fully control the distribution of spam. Existing combined tool systems may poll each tool for its output or classification results. In some cases, the results are combined by Boolean or conditional logic, which leads to problems in obtaining useful or correct results when the number of classifiers becomes large. Additionally, two “weak” or marginal “not spam” results may be combined to produce a firm or final “no” unless complicated ad hoc conditions are used to make the combined determination a more proper “yes” or “spam” result. In some embodiments, the results of the tools are combined with each tool having an equal voice or each tool having one vote. For example, in a system using three classification tools, a message may be identified as spam when two of the three tools determine a message is spam. Such an equal voice polling technique is generally not effective because it does not take into account the “confidence” of each tool. This polling technique may be used because the outputs of each of the tools is not standardized and is difficult to combine. Other systems apply a score to the results of each tool and then average the scores, but, again, this results in an averaging or scoring that gives each classification tool an equal voice or vote, which may result in false positives or failure to identify a portion of received spam messages.
In other classification systems, one or more classification tool is allowed to overrule or trump the outputs of all the other tools, but this is undesirable when these tools may also be fooled or produce incorrect classification results. For example, some combined classification tool systems allow blacklist or whitelist classifiers to overrule heuristic and other classification tools. However, as indicated earlier, whitelists can be fooled by providing a false source of an e-mail message and blacklists can falsely identify e-mail as spam when a source is inappropriately added to the list of spam sources. As a result, existing techniques of providing more weight or confidence to particular classification tools have not been entirely successful in better identifying spam messages.
There remains a need for an improved method and system for accurately classifying e-mail messages as unwanted or as spam. Preferably, such a method and system would be adapted to utilize existing (or later developed) classification tools to produce a single classification result that is more accurate and reliable than the results of the individual classification tools. Further, such a method and tool preferably would allow later developed classification tools to be added to enhance the single classification result without significant modifications.