Electronic messages have become an indispensable part of modern communication. Electronic messages such as email or instant messages are popular because they are fast, easy, and have essentially no incremental cost. Unfortunately, these advantages of electronic messages are also exploited by marketers who regularly send out unsolicited junk messages (also referred to as “spam”). Spam messages are a nuisance for users. They clog people's email box, waste system resources, often promote distasteful subjects, and sometimes sponsor outright scams.
There are many existing spam blocking systems that employ various techniques for identifying and filtering spam. For example, some systems generate a thumbprint (also referred to as signature) for each incoming message, and looks up the thumbprint in a database of thumbprints for known spam messages. If the thumbprint of the incoming message is found in the spam database, then the message is determined to be spam and is discarded.
Other techniques commonly used include whitelist, blacklist, statistical classifiers, rules, address verification, and challenge-response. The whitelist technique maintains a list of allowable sender addresses. The sender address of an incoming message is looked up in the whitelist; if a match is found, the message is automatically determined to be a legitimate non-spam message. The blacklist technique maintains a list of sender addresses that are not allowed and uses those addresses for blocking spam messages. The statistical classifier technique is capable of learning classification methods and parameters based on existing data. The rules technique performs a predefined set of rules on an incoming message, and determines whether the message is spam based on the outcome of the rules. The address verification technique determines whether the sender address is valid by sending an automatic reply to an incoming message and monitoring whether the reply bounces. A bounced reply indicates that the incoming message has an invalid sender address and is likely to be spam. The challenge-response technique sends a challenge message to an incoming message, and the message is delivered only if the sender sends a valid response to the challenge message.
Some of the existing systems apply multiple techniques sequentially to the same message in order to maximize the probability of finding spam. However, many of these techniques have significant overhead and can adversely affect system performance when applied indiscriminately. A technique may require a certain amount of system resources, for example, it may generate network traffic or require database connections. If such a technique were applied to all incoming messages, the demand on the network or database resources would be large and could slow down the overall system.
Also, indiscriminate application of these techniques may result in lower accuracy. For example, if a legitimate email message includes certain key spam words in its subject, the may be classified as spam if certain rules are applied. However, a more intelligent spam detection system would discover that the message is from a valid address using the address verification technique, thus allowing the message to be properly delivered. It would be useful to have a spam detection system that uses different spam blocking techniques more intelligently. It would be desirable for the system to utilize resources more efficiently and classify messages more accurately.