1. Field of the Invention
The present invention is related to anti-spam technology, and more particularly, to a method for analyzing statistics collected for botnets sending out SPAM.
2. Description of the Related Art
SPAM emails have become a veritable scourge of modern email systems. It has been estimated that as much as 60-80% of Internet email traffic today is of a SPAM nature. SPAM, in addition to being annoying and wasteful of the time of the recipient, places considerable burden on large email service providers, and on corporate networks. For a “regular user,” the “cost” of SPAM that gets through is the few clicks that it takes to delete the offending message.
For large scale email providers, such as Google, Yahoo, Microsoft, as well as for large corporations that have their own server-based solutions for SPAM filtering, handling SPAM is a problem that needs to be solved on an industrial scale. For example, such large mail service providers need to filter millions of SPAM messages every hour.
One phenomenon observed recently is the increasing professionalism of SPAM generators. Many of the techniques used by SPAM generators closely mirror, and borrow from, techniques used by professional virus writers. It has been estimated that at any given moment, millions computers connected to the Internet are zombified (i.e., compromised). In other words, these computers spew out vast numbers of SPAM emails, and emails with malicious attachments, even though the owners of these computers are unaware of this.
Large quantities of spam are now being sent by networks of compromised computers—botnets. The activity of these networks is of serious concern to security professionals all over the world, and the problem of tracking botnets is receiving considerable attention. However, large distributed networks of computers, most of which have dynamic IP addresses, are hard to track and separate from each other in order to identify a computer sending out SPAM in real time.
Another SPAM-related problem is a filtering out false positives. Generally, in the industry, a false positive is regarded as a much greater evil than letting through some number of SPAM messages, since very often, an email that was falsely identified as SPAM by the SPAM filter will never be seen by its intended recipient, or, at best, would be seen much later by the intended recipient.
In general, many present methods for SPAM identification have not been fully successful. For example, attempts to work with filters for sorting out source addresses of bulk email distributors have not proven successful. A network of compromised computers under a common control infrastructure is a powerful tool for managing various kinds of illegal activity. Such networks are commonly used for Distributed Denial-of-Service (DDoS) attacks, sending out SPAM, spreading malware or other malicious purposes.
Significant surges in SPAM activity are now being linked to the increasing use of botnets by spanners. Large computer networks distributed over several countries and continents are hard to track, and since most of the computers in such networks have dynamic IP addresses that can change every time the computer is started, deploying traditional blacklisting services is of almost no use.
To resist massive SPAM attacks from these bot networks, an instrument that is able to keep track of them in real time is needed. The lists of IP addresses of computers that constitute botnets need to be updated frequently to keep current with the changes in the set of active hosts. This is necessary to ensure a quick response to new SPAM sources and to exclude false positives when IP addresses drop out of the botnet.
One possible approach includes an in-depth examination of the MIME structure and performing analysis of the message content. The similarities between large numbers of messages, and hence the likelihood of their sources belonging to the same botnet, could be established even if changes were made to some of the messages—a practice that is widely used by spammers. However, this approach is both technically and administratively infeasible because the email destinations are mailboxes distributed all over the world.
Downloading millions of email messages from the destination mailboxes every hour to one location and analyzing them is not feasible. It can be extremely costly. Besides, this method would be inadvisable for security and privacy protection reasons, as it would mean private emails are being relayed to a third party for analysis.
Yet another possible approach is to analyze the fingerprints of the messages, rather than the messages themselves. A mail agent receiving a message can take its fingerprint and send it, along with the IP address of the source host, to a specific location for analysis. Then, hundreds or thousands of source hosts sending out messages with the same fingerprint would be indicative of a botnet. This approach can be implemented, since the size of a fingerprint is just a few bytes (typically 16 bytes) and it is impossible to recover a message from its fingerprint. Thus, clients' privacy can be protected.
The disadvantage of this approach is a need to construct such a fingerprint, so that it remains invariable under a transformation of the message. This would require a thorough examination of the message content including attached pictures, documents, etc. Spammers use various techniques to distort messages even within the same distribution. The message parts and attributes that can be altered are the text, the number, size and formats of attachments, images, text and transfer encodings, etc. As a result, even sufficiently “fuzzy” fingerprints do not guarantee detection. Moreover, increasing the “fuzziness” of the fingerprint is likely to result in a large number of false positives.
The false positives can be eliminated by analyzing botnets' statistics and maintaining botnets current by adding hosts sending out SPAM to the existing botnets. Accordingly, an efficient method for collecting and analyzing botnet's statistics is desired. It is also desired to have a method for maintaining and updating botnets based on statistical data collected from the botnets and from individual hosts sending out potential SPAM.