The present invention relates generally to digital data processors and networks of intercommunicating digital data processors capable of sending and receiving electronic mail and other types of electronic messages. In particular, the present invention relates to a system and method for automatically detecting and handling unsolicited and undesired electronic mail such as Unsolicited Commercial E-mail (UCE), also referred to as xe2x80x9cspam.xe2x80x9d
Every day, millions of Internet users receive unwelcome electronic messages, typically in the form of electronic mail (e-mail). The most familiar example of these messages is Unsolicited Commercial E-mail (UCE), commonly referred to as xe2x80x9cspam.xe2x80x9d UCE typically promotes a particular good, service or web site, and is sent indiscriminately to thousands, or even millions, of people, the vast majority of whom find the UCE annoying or even offensive. UCE is widely perceived as a significant problem. Articles concerning UCE appear on an almost daily basis on technology news services, such as CNET. Several commercial and shareware products have been written to reduce e-mail users"" exposure to UCE. At least one start-up company, Bright Light Technologies, has been founded for the sole purpose of producing and selling technology to detect and filter out UCE. Legal restrictions are being contemplated by several states, and actually have recently been put in place in more than one state.
Other forms of undesired e-mail include rumors, hoaxes and chain letters. Each of these forms of e-mail can proliferate within a network of users very quickly. Rumors can spread with much vigor throughout a user population and can result in wasted time and needless concern. The most successful computer virus hoaxes have a longevity comparable to that of computer viruses themselves, and can cause a good deal of panic. Finally, circulation of chain letters is a phenomenon that is serious enough to be forbidden by company policies or even federal laws.
A somewhat different class of e-mail, the transmission or receipt of which is often undesirable, is confidential e-mail. Confidential e-mail is not supposed to be forwarded to anyone outside of some chosen group. Therefore, there is a concern for controlling the distribution of these messages.
A common characteristic of UCE and electronically-borne rumors, hoaxes, and chain letters is that there is likely to be wide-spread agreement that the content of the message in question (and, thus, transmission thereof) is undesirable (as opposed to merely uninteresting). This, along with the fact that such messages are in electronic form, makes it possible to contemplate various technologies that attempt to automatically detect and render harmless this e-mail.
To date, UCE has been the exclusive focus of such efforts. Existing UCE solutions take a number of different forms. Some are software packages designed to work with existing e-mail packages (e.g., MailJail, which is designed to work with the Eudora mail system) or e-mail protocols (e.g., Spam Exterminator, which works for any e-mail package that supports the POP3 protocol on the Windows 95, Windows 98 or Windows NT platforms). Other solutions are integrated into widely used mail protocols (e.g., SendMail v. 8.8, a recent upgrade of the SendMail mail transfer protocol, which provides a facility for blocking mail relay from specified sites, or alternatively from any site other than those explicitly allowed). Another type of solution is an e-mail filtering service, e.g., the one offered by junkproof corn, which fines users who send UCE. Bright Light Technologies proposes to combine a software product with a service.
However they may be packaged, the vast majority of these solutions are composed of two main steps: recognition and response. In the recognition step, a given e-mail message is examined to determine whether it is likely to be spam. If the message is deemed likely to be spam during the recognition step, then some response is made. Typical responses include automatically deleting the message, labeling it or flagging it to draw the user""s attention to the fact that it may be spam, placing it in a lower priority mail folder, etc., perhaps coupled with sending a customizable message back to the sender.
The main technical challenges lie in the recognition step. Two of the most important challenges include keeping the rates of false positives (falsely accusing legitimate mail as spam) and false negatives (failing to identify spam as such) as low as possible. A wide variety of commercial and freeware applications employ combinations and/or variations on the following basic spam detection strategies to address the general problem.
Often, persons who send spam (xe2x80x9cspammersxe2x80x9d) set up special Internet address domains from which they send spam. One common anti-spam solution is to maintain a blacklist of xe2x80x9cspamxe2x80x9d domains, and to reject, not deliver or return to the sender any mail originating from one of these domains. When spam begins to issue from a new xe2x80x9cspamxe2x80x9d domain, that domain can be added to the blacklist.
For example, xmission.com has modified sendmail.cf rules to cause mail from named sites to be returned to the sender. Their text file (http:H/spam.abuse.net/spam/tools/dropbad.txt) lists several domains that are known to be set up solely for use by spammers, including moneyworld.com, cyberpromo.com, bulk-e-mail.com, bigprofits.com, etc. At http://www.webeasy.com:8080/spam/spam_download_table, one can find just over 1000 such blacklisted sites. Recent versions of SendMail (versions 8.8 and above) have been modified to facilitate the use of such lists, and this has been regarded as an important development in the battle against spam.
However, if used indiscriminately, this approach can lead to high rates of false positives and false negatives. For instance, if a spammer were to send spam from the aol.com domain, aol.com could be added to the blacklist. As a result, millions of people who legitimately send mail from this domain would have their mail blocked. In other words, the false positive rate would be unacceptably high. On the other hand, spammers can switch nimbly from a banned domain to a non-banned, newly-created one, or one that is used by many legitimate users, thus leading to many false negatives.
A hallmark of spam is that it is sent to an extremely large number of recipients. There are often indications of this in the header of the mail message that can be taken as evidence that a message is likely to be spam. For example, the long list of recipients is typically dealt with by sending to a smaller set of collective names, so that the user""s explicit e-mail address does not appear in the To: field.
Ross Rader of Internet Direct (Idirect) has published directions for setting up simple rules based on this characteristic of spam for a variety of popular e-mail programs, including Eudora Light, Microsoft Mail and Pegasus. When a mail message header matches the rule, that mail is automatically removed from the user""s inbox and placed in a special folder where it can be examined later or easily deleted without inspection.
However, unless the user of this method puts a great deal of effort into personalizing these detection rules, the false positive rate has the potential to be quite high, so that a large proportion of legitimate e-mail will be classified as spam.
Spam is typically distinguished from ordinary e-mail in that it aggressively tries to sell a product, advocate visiting a pornographic web site, enlist the reader in a pyramid scheme or other monetary scam, etc. Thus, a piece of mail containing the text fragment xe2x80x9cMAKE MONEY FASTxe2x80x9d is more likely to be spam than one that begins xe2x80x9cDuring my meeting with you last Tuesday.xe2x80x9d
Some anti-spam methods scan the body of each e-mail to detect keywords or keyphrases that tend to be found in spam, but not in other e-mail. The keyword and keyphrase lists are often customizable. This method is often combined with the domain- and header-based detection techniques described hereinabove. Examples of this technology include junkfilter (http://www.pobox.com/gsutter/junkmail), which works with procmail, Spam Exterminator and SPAM Attack Pro!.
Again, false positives may occur when ordinary e-mail messages contain banned keywords or keyphrases. This approach is prone to false negatives as well because the list of banned keyphrases would have to be updated several times per day to keep up with the influx of new instances of spam, and this is both technically difficult for the anti-spam vendor and unpalatable to the user.
Spam Be Gone! is a freeware product that works with Eudora. It uses an instance-based classifier that records examples of spam and non-spam e-mail, and measures the similarity of each incoming e-mail to each of the instances, combining the similarity scores to arrive at a classification of the e-mail as spam or non-spam. The classifier is trained automatically for each individual user. It typically takes the user several weeks to a few months to develop a classifier.
After a sufficient amount of training, the false positive and false negative rates for this approach are claimed to be lower than for other techniques. In one cited case (http://www.internz.com/SpamBeGone/stats.html), which can be assumed to be an upper bound on the performance since an average over several users is not provided, the false negative rate was less than a few tenths of a percent after one or two months of training, while the false positive rate was 20% after one month and 5% after two months. Thus, even in the best case, 1 of every 20 messages labeled as spain will, in fact, be legitimate. This could be unacceptable, particularly if the anti-spam software responds in a strong manner, such as automatically deleting the mail or returning it to the sender.
All of the above UCE detection methods are xe2x80x9cgenericxe2x80x9d in the sense that they use features that are generic to spam but much less common in ordinary non-spam e-mail. This is in contrast to xe2x80x9cspecificxe2x80x9d detection techniques that are commonly employed by anti-virus programs to detect specific known computer viruses, typically by scanning host programs for special xe2x80x9csignaturexe2x80x9d byte patterns that are indicative of specific viruses. Generic recognition techniques are attractive because they can catch new, previously unknown spam. However, as indicated hereinabove, their disadvantage is that they tend to yield unacceptably high false positive rates and, in some cases, unacceptably high false negative rates as well. Specific detection techniques typically have smaller false positive and false negative rates, but require more frequent updating than do generic techniques.
Generic detection techniques are even less likely to be helpful in recognizing other types of undesirable e-mail, such as rumors, hoaxes and chain letters or confidential e-mail. Recognition based on the sender""s domain or other aspects of the mail header is unlikely to work at all. Generic recognition of hoaxes and chain letters on the basis of keywords or keyphrases present in the message body may be possible, but is likely to be more difficult than for spam because the range in content is likely to be broader. Generic recognition of confidential e-mail on the basis of text is almost certainly impossible because there is nothing that distinguishes confidential from non-confidential text in a way that is recognizable by any machine algorithm.
Bright Light Technologies promotes a different anti-spam product/service. Bright Light uses a number of e-mail addresses (or xe2x80x9cprobesxe2x80x9d) throughout the Internet which, in theory, receive only undesirable messages since they are not legitimate destinations. The messages received are read by operators located at a 24-hour a day operations center. These operators evaluate the messages and update rules which control a spam-blocking function in a mail server that serves a group of users.
While this method of UCE detection and response is inherently less vulnerable to false positives and false negatives because it uses specific rather than generic detection, it suffers from some drawbacks. Many of these stem from the considerable amount of manual effort required to maintain the service. The Bright Light operations center must employ experts who monitor streams of e-mail for spam, manually extract keywords and keyphrases that they believe to be good indicators of specific instances of spam, and store these keywords or keyphrases in a database. As it would most likely be prohibitive for any company to support such a set of experts on its own, any company wishing to protect itself in this way would be entirely dependent on continued, uninterrupted service by Bright Light""s operations center. At least some companies might well prefer a solution that allows for greater freedom from an external organization, and greater customization than is likely to be achieved by a single organization. The crux of the problem is that Bright Light""s method couples two tasks that ought to be independent of one another: labeling a message as undesirable, and extracting a signature from the undesirable message. If it were possible to reduce the requirement for manual input to that of labeling undesirable messages, this would enable localized collaborative determinations of undesirable messages. Furthermore, Bright Light does not describe a process by which experts extract auxiliary data that permit possible matches based on keywords or phrases to be tested more stringently by exact or approximate matching to entire specific messages (or large portions of them). Thus their specific solution is likely to be more vulernable to false positives than one in which individual users would have the opportunity to specify more stringent conditions for message matching.
Another drawback is that the Bright Light solution is specifically targeted at UCE, as opposed to the broader class of undesirable messages that includes hoaxes, chain letters, and improperly forwarded confidential messages. Taken together, probe accounts may receive a reasonable fractional of all UCE, but it is unclear that they would attract chain letters and rumors.
It is, thus, an object of the present invention to provide an automatic, non-generic procedure for detecting and handling instances of all types of undesirable mail, with very low false positive and false negative rates.
A further object of the present invention is to provide an inexpensive solution which involves no staffing, but rather utilizes the users themselves to actively identify UCE.
A still further object of the present invention is to provide a system and method for preventing the undesired transmission and/or receipt of confidential e-mail messages.
The present invention provides an automated procedure for detecting and handling UCE and other forms of undesirable e-mail accurately, with low false negative rates and very low false positive rates. In contrast to existing generic detection methods, the present invention uses a specific detection technique to recognize undesirable messages. In other words, the system of the present invention efficiently detects undesirable messages on the basis of their exact or close matches to specific instances of undesirable messages. In contrast to the specific technique use by Bright Light, the character strings used to identify specific undesirable messages are derived completely automatically, and are supplemented with auxiliary data that permit the end user to tune the degree of match required to initiate various levels of response. A further point of contrast is that the automatic derivation of signature data permits greater flexibility because the only required manual input is the labeling of a particular message as undesirable. This permits ordinary users to work collaboratively to define undesirable messages, freeing them from dependence on an external, centralized operations center where experts must manually label and extract signatures from undesirable messages. It also permits authorities on hoaxes and chain letters to identify messages containing them, without further imposing the burden of extracting a signature, which would require a very different sort of expertise. Another point of contrast is that the extracted signature data can permit users to define independent, flexible definitions of what constitutes a given level of match, ranging from matching a signature to matching an entire message verbatim.
The method of the present invention includes, when a first (xe2x80x9calertxe2x80x9d) user receives a given instance of undesirable mail, labeling the message as undesirable, extracting a signature for the message, adding the signature to a signature database, periodically scanning a second (possibly including the same) users messages for the presence of any signatures in the database, identifying any of the second user""s messages that contain a signature as undesirable and responding appropriately to any messages so labeled.
Specifically, the method of hindering an undesirable transmission or receipt of electronic messages within a network of users, includes the steps of: determining that transmission or receipt of at least one specific electronic message is undesirable; automatically extracting detection data that permits detection of the at least one specific electronic message or variants thereof, scanning one or more inbound and/or outbound messages from at least one user for the presence of the at least one specific electronic message or variants thereof, and taking appropriate action, responsive to the scanning step. Preferably, the method further includes the step of storing the extracted detection data.
Preferably, the determining step comprises the step of receiving notification that proliferation of the at least one specific electronic message is undesirable. The receiving step preferably includes the step of receiving a signal from an alert user identifying the at least one specific electronic message as undesirable or confidential. The at least one specific electronic message can be received in an inbox of the alert user. The receiving step preferably includes the step of providing an identifier for the alert user to indicate that the specific electronic message is to be flagged as undesirable. It is preferable that the providing step comprises the step of providing a generic detector to aid in identification of undesirability of electronic messages.
The extracting step of the present invention preferably includes the step of extracting, from the at least one specific electronic message, signature information. The storing step preferably comprises the step of adding, responsive to the scanning step, information pertaining to the at least one specific electronic message to the signature information. The signature information preferably includes a signature from the at least one specific electronic message. The storing step can include the step of storing the signature in at least one signature database. The signature database preferably comprises a plurality of signature clusters, each cluster including data corresponding to substantially similar electronic messages. Each of the signature clusters preferably comprises a character sequence component having scanning information and an archetype component having identification information about particular signature variants. The scanning information preferably includes a search character sequence for a particular electronic message and extended character sequence information for all the electronic messages represented in the cluster and wherein the identification information includes a pointer to a full text stored copy of an electronic message relating to a particular signature variant, a hashblock of the electronic message, and alert data corresponding to specific instances where a copy of the electronic message was received and the proliferation of which was reported as undesirable by an alert user.
The extracting step and the scanning step of the present invention can occur simultaneously and asynchronously across the network of users.
The method of present invention can further include the step of confirming, before the scanning step, the undesirability of the at least one specific electronic message. The confirming step preferably comprises the step of confirming, with a generic detection technique, the undesirability of the at least one specific electronic message. The method of claim 16 wherein the confirming step comprises the step of requiring that a predetermined threshold number of users signal that the at least one specific electronic message is undesirable.
The extracting step preferably comprises the steps of: scanning the specific electronic message for any signatures in the at least one signature database; and comparing, responsive to finding a matching signature in the scanning step, the matching signature to each message variant in a matching cluster. The comparing step preferably comprises the steps of: computing a hashblock for the specific electronic message; and comparing the computed hashblock with variant hashblocks in the identification information of each archetype component. It is preferable that the method of the present invention further comprise the steps of: if an exact variant hashblock match is found, retrieving the full text stored copy of the variant match using the pointer, and if the full text stored copy of the variant match and the full text of the specific electronic message are deemed sufficiently similar to regard the specific electronic message as an instance of the variant, extracting alert data from the specific electronic message and adding it to the alert data for the variant match; else if an exact variant hashblock match is not found or the full text of the specific electronic message is found to be insufficiently similar to any of the variants in the database, determining whether the specific electronic message is sufficiently similar to any existing cluster; if the specific electronic message is sufficiently similar to an existing cluster, computing new identification information associated with specific electronic message; else if the specific electronic message is not determined to be sufficiently similar to an existing cluster, creating a new cluster for the specific electronic message. The determining step preferably comprises the steps of: computing a checksum of a region of the specific electronic message indicated in the extended character sequence information for each cluster; and comparing the computed checksum with a stored checksum in the extended character sequence information of each cluster. The method preferably further comprises the step of creating, if no signature match is found, a new cluster for the specific electronic message. The extended character sequence information preferably includes a beginoffset field, a regionlength field and a CRC field, the method further comprising the steps of: determining, for each cluster, a matching region with a longest regionlength; and identifying, if the longest regionlength among all the clusters is at least equal to a specified threshold length, a longest regionlength cluster as an archetype cluster to which the specific electronic message archetype is to be added. Finally, the method of the present invention preferably comprises the step of recomputing the scanning information of the identified cluster. The alert data preferably includes a receivetime field having a time at which a copy was originally received and wherein the method further comprises the steps of: periodically comparing the receivetime field of all variants of each signature cluster with the current time; and removing a signature cluster in which none of the receivetime fields are more recent than a predetermined date and time.
The scanning step preferably comprises the steps of: extracting a message body; transforming the message body into an invariant form; scanning the invariant form for exact or near matches to the detection data; and determining, for each match, a level of match.
The taking step preferably comprises the step of taking appropriate action, upon discovering the presence of the at least one specific electronic message or variants thereof The taking step can comprise the step of labeling the at least one specific electronic message or variants thereof as undesirable or confidential. The taking step also can comprise the step of removing the at least one specific electronic message or variants thereof.
The taking step preferably comprises the step of taking appropriate action for each determined level of match, responsive to one or more user preferences and the determining step preferably comprises the steps of: finding the longest regional matches for each match; computing hashblock similarities between a hashblock of the scanned message and hashblocks of each of the extracted detection data; receiving one or more user preferences; and determining a level of match responsive to the finding, computing and receiving steps.
The present invention also includes a program storage device, readable by a machine, tangibly embodying a program of instructions executable by the machine to perform method steps for hindering an undesirable transmission or receipt of electronic messages within a network of users, the method comprising the steps of: determining that transmission or receipt of at least one specific electronic message is undesirable; automatically extracting detection data that permits detection of the at least one specific electronic message or variants thereof; scanning one or more inbound and/or outbound messages from at least one user for the presence of the at least one specific electronic message or variants thereof; and taking appropriate action, responsive to the scanning step.
Finally, the present invention also includes a system for hindering an undesirable transmission or receipt of electronic messages within a network of users, comprising: means for determining that transmission or receipt of at least one specific electronic message is undesirable; means for automatically extracting detection data that permits detection of the at least one specific electronic message or variants thereof, means for scanning one or more inbound and/or outbound messages from at least one user for the presence of the at least one specific electronic message or variants thereof; and means for taking appropriate action, responsive to the scanning means. Otherwise, the preferable embodiments of the system match those of the method of the present invention.