Electronic messaging, particularly electronic mail (“e-mail”) carried over the Internet, is rapidly becoming not only pervasive in society but also, given its informality, ease of use and low cost, a preferred mode of communication for many individuals and organizations.
Unfortunately, as has occurred with more traditional forms of communication (e.g., postal mail and telephone), e-mail recipients are increasingly being subjected to unsolicited mass mailings. With the explosion, particularly in the last few years, of Internet-based commerce, a wide and growing variety of electronic merchandisers is repeatedly sending unsolicited mail advertising their products and services to an ever expanding universe of e-mail recipients. Most consumers that order products or otherwise transact with a merchant over the Internet expect to and, in fact, regularly receive such merchant solicitations. However, electronic mailers are continually expanding their distribution lists to penetrate deeper into society in order to reach ever increasing numbers of recipients. For example, recipients who merely provide their e-mail addresses in response to perhaps innocuous appearing requests for visitor information generated by various web sites, often find, later upon receipt of unsolicited mail and much to their displeasure, that they have been included on electronic distribution lists. This occurs without the knowledge, let alone the assent, of the recipients. Moreover, as with postal direct mail lists, an electronic mailer will often disseminate its distribution list, whether by sale, lease or otherwise, to another such mailer, and so forth with subsequent mailers. Consequently, over time, e-mail recipients often find themselves barraged by unsolicited mail resulting from separate distribution lists maintained by a wide and increasing variety of mass mailers. Though certain avenues exist, based on mutual cooperation throughout the direct mail industry, through which an individual can request that his(her) name be removed from most direct mail postal lists, no such mechanism exists among electronic mailers.
Once a recipient finds him(her)self on an electronic mailing list, that individual can not readily, if at all, remove his(her) address from it, thus effectively guaranteeing that he(she) will continue to receive unsolicited mail—often in increasing amounts from that list and oftentimes other lists as well. This occurs simply because the sender either prevents a recipient of a message from identifying the sender of that message (such as by sending mail through a proxy server) and hence precludes the recipient from contacting the sender in an attempt to be excluded from a distribution list, or simply ignores any request previously received from the recipient to be so excluded.
An individual can easily receive hundreds of unsolicited postal mail messages over the course of a year, or less. By contrast, given the ease and insignificant cost through which e-distribution lists can be readily exchanged and e-mail messages disseminated across large numbers of addressees, a single e-mail addressee included on several distribution lists can expect to receive a considerably larger number of unsolicited messages over a much shorter period of time. Furthermore, while many unsolicited e-mail messages (e.g., offers for discount office or computer supplies or invitations to attend conferences of one type or another) are benign; others, such as pornographic, inflammatory and abusive material, can be highly offensive to certain recipients.
Unsolicited e-mail messages are commonly referred to as “spam”. Similar to the task of handling junk postal mail, an e-mail recipient must sift through his(her) incoming mail to remove spam. Unfortunately, the choice of whether a given e-mail message is spam or not is highly dependent on the particular recipient and content of the message—what may be spam to one recipient may not be so to another. Frequently, an electronic mailer will prepare a message such that its true content is not apparent from its subject line and can only be discerned from reading the body of the message. Hence, the recipient often has the unenviable task of reading through each and every message he(she) receives on any given day, rather than just scanning its subject line, to fully remove spam messages. Needless to say, such filtering (often manually-based) can be a laborious, time-consuming task.
In an effort to automate the task of detecting abusive newsgroup messages (so-called “flames”), the art teaches an approach of classifying newsgroup messages through a rule-based text classifier. See, E. Spertus “Smokey: Automatic Recognition of Hostile Messages”, Proceedings of the Conference on Innovative Applications in Artificial Intelligence (IAAI), 1997. Here, semantic and syntactic textual classification features are first determined by feeding an appropriate corpus of newsgroup messages, as a training set, through a probabilistic decision tree generator. Given handcrafted classifications of each of these messages as being a “flame” or not, the generator delineates specific textual features that, if present or not in a message, can predict whether, as a rule, the message is a flame or not. Those features that correctly predict the nature of the message with a sufficiently high probability are then selected for subsequent use. Thereafter, to classify an incoming message, each sentence in that message is processed to yield a multi-element (e.g., 47 element) feature vector, with each element simply signifying the presence or absence of a different feature in that sentence. The feature vectors of all sentences in the message are then summed to yield a message feature vector (for the entire message). The message feature vector is then evaluated through corresponding rules produced by the decision tree generator to assess, given a combination and number of features that are present or not in the entire message, whether that message is either a flame or not. For example, as one semantic feature, the author noticed that phrases having the word “you” modified by a certain noun phrase, such as “you people”, “you bozos”, “you flamers”, tend to be insulting. An exception is the phrase “you guys” which, in use, is rarely insulting. Therefore, one feature is whether any of these former word phrases exist. The associated rule is that, if such a phrase exists, the sentence is insulting and the message is a flame. Another feature is the presence of the word “thank”, “please” or phrasal constructs having the word “would” (as in: “Would you be willing to e-mail me your logo”) but not the words “no thanks”. If any such phrases or words are present (with the exception of “no thanks”), an associated rule, which the author refers to as the “politeness rule” categorizes the message as polite and hence not a flame. With some exceptions, the rules used in this approach are not site-specific, that is, for the most part they use the same features and operate in the same manner regardless of the addressee being mailed.
A rule based textual e-mail classifier, here specifically one involving learned “keyword-spotting rules”, is described in W. W. Cohen, “Learning Rules that Classify E-mail”, 1996 AAAI Spring Symposium on Machine Learning in Information Access, 1996 (hereinafter the “Cohen” publication). In this approach, a set of e-mail messages previously classified into different categories is provided as input to the system. Rules are then learned from this set in order to classify incoming e-mail messages into the various categories. While this method does involve a learning component that allows for automatic generation of rules, these rules simply make yes/no distinctions for classification of e-mail messages into different categories without providing any confidence measure for a given prediction. Moreover, in this work, the actual problem of spam detection was not addressed. In this regard, rule-based classifiers suffer various serious deficiencies which, in practice, would severely limit their use in spam detection. First, existing spam detection systems require users to manually construct appropriate rules to distinguish between legitimate mail and spam. Most recipients will not bother to undertake such laborious tasks. As noted above, an assessment of whether a particular e-mail message is spam or not can be rather subjective with its recipient. What is spam to one recipient may, for another, not be. Furthermore, non-spam mail varies significantly from person to person. Therefore, for a rule based-classifier to exhibit acceptable performance in filtering most spam from an incoming mail stream, the recipient must construct and program a set of classification rules that accurately distinguishes between what constitutes spam and what constitutes non-spam (legitimate) e-mail. Properly doing so can be an extremely complex, tedious and time-consuming task even for a highly experienced and knowledgeable computer user.
Second, the characteristics of spam and non-spam e-mail may change significantly over time; rule-based classifiers are static (unless the user is constantly willing to make changes to the rules). Accordingly, mass e-mail senders routinely modify content of their messages in a continual attempt to prevent (“outwit”) recipients from initially recognizing these messages as spam and then discarding those messages without fully reading them. Thus, unless a recipient is willing to continually construct new rules or update existing rules to track changes to spam (as that recipient perceives such changes), then, over time, a rule-based classifier becomes increasingly inaccurate at distinguishing spam from desired (non-spam) e-mail for that recipient, thereby further diminishing utility of the classifier and frustrating the user/recipient.
Alternatively, a user might consider employing a method for learning rules (as in the Cohen publication) from their existing spam in order to adapt, over time, to changes in an incoming e-mail stream. Here, the problems of a rule-based approach are more clearly highlighted. Rules are based on logical expressions; hence, as noted above, rules simply yield yes/no distinctions regarding the classification for a given e-mail message. Problematically, such rules provide no level of confidence for their predictions. Inasmuch as users may have various tolerances as to how aggressive they would want to filter their e-mail to remove spam, then, in an application such as detecting spam, rule-based classification would become rather problematic. For example, a conservative user may require that the system be very confident that a message is spam before discarding it, whereas another user many not be so cautious. Such varying degrees of user precaution cannot be easily incorporated into a rule-based system such as that described in the Cohen publication.