Well publicized vulnerabilities of public and private packet communications networks to a variety of malicious activity by computer hackers and others have been reported in the technical and business literature—as well as in the popular press. Computer viruses, computer worms and widely experienced e-mail spam are among the most prevalent and potentially most injurious forms of malicious activity perpetrated on unsuspecting computer users.
Perhaps because e-mail spam so universally confronts computer users, this plague of unwanted and often offensive communications has been the subject of much research and experimentation—with a good degree of success. By recognizing the occurrence of certain words and phrases in e-mail header information and message content, it has been possible to intercept and neutralize a wide range of unwanted e-mail. Some particular anti-spam techniques have proven to be of limited use over extended periods of time, however, because of the resourcefulness of increasingly capable spammers in deceiving these anti-spam efforts.
Thus, as in many technological battles, an increase is spamming is faced with an increasing number of tools to combat the flood of e-mail spamming. These tools, in turn, are sought to be avoided or worked around by different particular spam formats, content phrasings, addressing and other techniques. The point-counterpoint battling continues. It is highly desired, therefore, that anti-spam and other anti-malware efforts be self-adapting to changed strategies of spammers and other practitioners of malware distribution.
One approach of many available anti-spam filtering efforts is described in Soonthornphiasaj, et al., “Anti-Spam Filtering: A Centroid-Based Classification Approach,” IEEE ICSP '02 Proceedings, June, 2002. This technique is applied using weighted word term frequency vector operations on spam e-mail samples and legitimate e-mail samples to yield centroid vectors representative of each class of e-mails. These centroid vectors are then used for similarity comparisons with newly arrived e-mail messages.
Another classifier for determining the likelihood that a received e-mail should be deemed spam or legitimate is described in U.S. Pat. No. 6,161,130 issued Dec. 12, 2000 to E. Horvitz, et al.
Address-based anti-spamming activity is described in U.S. Pat. No. 6,052,709 issued Apr. 18, 2000 to S. Paul, where broadcast alert signals are used to disseminate potential sources of spam once detected at distributed sites in a network. Such address-based anti-spam approaches are sometimes combined with content-based approaches (e.g., using lists of character strings) to provide a multiple-filter technique, as described in U.S. Pat. No. 6,023,732 issued Feb. 8, 2000 to W. B. McCormick, et al.
Often, e-mail spam, when recognized, is removed without actually opening such e-mail, by using a combination of user techniques, as described in U.S. Pat. No. 6,493,007 B1, issued Dec. 10, 2002 to S. P. Pang. Such techniques often require considerable recipient participation, however.
Recently, a new approach has been applied in a number of anti-spamming efforts. This approach is based on long-known Bayesian statistical techniques that are widely used in a variety of statistical applications. In the context of anti-spamming activities, incoming e-mail is filtered using a Bayesian classifier that has learned characteristics of both unsolicited (spam) and legitimate (non-spam) e-mail. Received e-mail is then classified using the Bayesian filter and a determination of probable spam/legitimate status is made and the learning of the classifier is updated.
Another application of Bayesian filter techniques is described in U.S. Pat. No. 6,732,157 B1 issued May 4, 2004 to B. P. Gordon, et al. Bayesian techniques for diagnosis of actual or potential faults in communications networks are described in U.S. Pat. No. 6,076,083, issued Jun. 13, 2000 to M. Baker. Using techniques for deriving and manipulating conditional probabilities for classes of events, the Baker technique seeks to determine probable cause of network faults.
More generally, a class of Bayesian filtering techniques has been developed that is known as Learning Bayesian Networks, as described, for example in a book by that name by R. E. Neapolitan, Prentice-Hall, 2003. A useful tutorial on Learning Bayesian Networks is provided in D. Heckerman, “A Tutorial on Learning With Bayesian Networks,” March, 1995, revised November, 1996, available at www.microsoft.com as MSR-TR-95-06 and in numerous other printed publications.
Such Learning Bayesian Networks and related techniques have been widely discussed in the literature, including: I. Androutsopoulos, et al., An Experimental Comparison of Naïve Bayesian and Keyword-Based Anti-Spam Filtering with Personal E-Mail Messages, Proc. ACM SIGIR 2000, Athens, Greece, July, 2000, pp. 160-167; C. O'Brien, et al., “Invited workshop on conceptual information retrieval and clustering of documents: Spam Filters: Bayes vs. Chi-Squared; Letters vs. Words” Proceedings of the 1st international symposium on Information and communication technologies ISICT '03. September, 2003, pp. 291-296. Other descriptions of Bayesian anti-spam approaches are described in Gary Robinson, “A Statistical Approach to the Spam Problem,” Linux J. vol. 2003, Issue 107, March, 2003. Sara Sinclair, “Adapting Bayesian Statistical Spam Filters to the Server Side.” Journal of Computing Sciences in Colleges, Volume 19 Issue 5, May, 2004, pp. 344-346; Le Zhang, Jingbo Zhu, Tianshun Yao, “An Evaluation of Statistical Spam Filtering Techniques.” ACM Transactions on Asian Language Information Processing (TALIP), Volume 3 Issue 4, December, 2004, pp. 243-269; Stefan Axelsson, “VizSEC Innovative Visualizations Session: Combining a Bayesian Classifier with Visualisation: Understanding the IDS.” Proceedings of the 2004 ACM workshop on Visualization and data mining for computer security, October, 2004, pp. 99-108.
Commercial products employing Bayesian filtering techniques to detect spam include McAfee SpamKiller, as described at http://www.mcafeesecurity.com. Other such products are offered by GFi, as disclosed in a White Paper entitled “Why Bayesian filtering is the most effective anti-spam technology,” at http://www.GFI.COM
Other malicious activities encountered by users of communications networks include computer viruses and worms. A representative discussion of mechanisms associated with viruses is presented, for example, in M. M. Williamson, “Throttling Viruses: Restricting propagation to defeat malicious mobile code,” Proc. IEEE 18th Computer Security Applications Conference (ACSAC '02), 2002. Computer worms and some of their characteristics are described, for example, in N. Weaver, et al., “A Taxonomy of Computer Worms,” Proc. ACM WORM '03, Oct. 27, 2003. A common source of infection of computers by viruses and/or worms is received e-mail.
While many of the anti-virus and anti-worm techniques previously employed have been effective to varying degrees in many circumstances, they have not realized the full potential of Bayesian filtering. Nor have prior anti-virus and anti-worm techniques employed the full learning power of Learning Bayesian Networks in exploiting many of the characteristics of viruses and worms that make such malware so potentially devastating.