Electronic documents can be classified in many ways. Classification of electronic documents (e.g., electronic communications) may be based upon the contents of the communication, the source of the communication, and whether or not the communication was solicited by the recipient, among other criteria. For example, electronic communications may be classified as spam. Whether or not an electronic communication is spam is based upon the subjective opinion of the recipient, though generally spam is any unsolicited, non-consensual, electronic communication, typically of a commercial nature, and usually transmitted in bulk to many recipients. Spam includes unsolicited commercial e-mail (UCE), unsolicited bulk e-mail (UBE), gray mail, and just plain “junk mail,” and is typically used to advertise products.
Receiving and addressing spam is costly and annoying, so considerable effort is being made to detect spam and prevent its delivery to the intended recipient.
One prior art scheme for spam detection involves application of a rules-based filtering system. Such rules may be based on terms within the communication. For example, if the subject line of the communication includes the term “make money,” the communication may be determined to be spam. Such rules may also be based upon the absence of information. For example, if a communication does not identify the sender, the communication may be determined to be spam.
Such schemes, while somewhat successful in determining spam, have several serious drawbacks. For example, such schemes usually employ hundreds or thousands of rules, each of which is formulated independently. The cost of developing such rules is prohibitive. Also, because each received electronic communication has to be validated against the myriad of rules, such schemes require expensive hardware to support the intensive computation that such validation requires. Moreover, spam senders are adept at changing the spam to avoid detection by such rules.
Another prior art scheme for detecting spam includes the use of statistical classifiers (e.g., a Bayesian classifier) that determine an electronic communication is spam based upon an analysis of words that occur frequently in spam. Such statistical classifier-based schemes can be defeated by various methods known to producers of spam (spammers). For example, spammers may encode the body of an electronic communication to avoid detection based upon words within the electronic communication.
More sophisticated statistical classifiers have recently been developed that classify communications based upon structural attributes of the communication. Such schemes, while addressing some of the drawbacks of previous statistical classification schemes, also have disadvantages in regard to computational resources.
Still another prior art scheme involves using the classification analysis of a community of users in order to classify electronic communications. In such a scheme, a number of users would identify a particular communication as spam. When the number of users identifying the particular communication as spam reached a specified threshold, the communication would be determined to be spam. This type of user-feedback classification scheme has disadvantages in terms of the length of time it takes to classify a communication.